Update to Hugging Face standard model format
Browse files- .gitattributes +2 -0
- LICENSE +0 -21
- MODEL_CARD.md +0 -164
- README.md +107 -158
- models/mini/config.json → config.json +0 -0
- data/sample_data.jsonl +0 -10
- demo.py +0 -510
- examples/basic_usage.py +0 -85
- examples/clustering.py +0 -109
- examples/semantic_search.py +0 -108
- models/mini/model.pt → model.pt +0 -0
- models/mini/model.safetensors → model.safetensors +0 -0
- models/large/README.md +0 -5
- models/medium/README.md +0 -5
- models/product/README.md +0 -5
- models/small/README.md +0 -5
- requirements.txt +0 -14
- src/__pycache__/__init__.cpython-313.pyc +0 -0
- src/__pycache__/inference.cpython-313.pyc +0 -0
- src/__pycache__/model.cpython-313.pyc +0 -0
- src/__pycache__/tokenizer.cpython-313.pyc +0 -0
- models/mini/tokenizer.json → tokenizer.json +0 -0
- models/mini/training_info.json → training_info.json +0 -0
.gitattributes
CHANGED
|
@@ -1,2 +1,4 @@
|
|
| 1 |
models/mini/model.pt filter=lfs diff=lfs merge=lfs -text
|
| 2 |
models/mini/model.safetensors filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 1 |
models/mini/model.pt filter=lfs diff=lfs merge=lfs -text
|
| 2 |
models/mini/model.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
model.pt filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
model.safetensors filter=lfs diff=lfs merge=lfs -text
|
LICENSE
DELETED
|
@@ -1,21 +0,0 @@
|
|
| 1 |
-
MIT License
|
| 2 |
-
|
| 3 |
-
Copyright (c) 2024
|
| 4 |
-
|
| 5 |
-
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 6 |
-
of this software and associated documentation files (the "Software"), to deal
|
| 7 |
-
in the Software without restriction, including without limitation the rights
|
| 8 |
-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
| 9 |
-
copies of the Software, and to permit persons to whom the Software is
|
| 10 |
-
furnished to do so, subject to the following conditions:
|
| 11 |
-
|
| 12 |
-
The above copyright notice and this permission notice shall be included in all
|
| 13 |
-
copies or substantial portions of the Software.
|
| 14 |
-
|
| 15 |
-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
| 16 |
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
| 17 |
-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
| 18 |
-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
| 19 |
-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
| 20 |
-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
| 21 |
-
SOFTWARE.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MODEL_CARD.md
DELETED
|
@@ -1,164 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
language: en
|
| 3 |
-
license: mit
|
| 4 |
-
tags:
|
| 5 |
-
- text-embedding
|
| 6 |
-
- sentence-similarity
|
| 7 |
-
- semantic-search
|
| 8 |
-
- product-matching
|
| 9 |
-
- transformer
|
| 10 |
-
- pytorch
|
| 11 |
-
- from-scratch
|
| 12 |
-
library_name: pytorch
|
| 13 |
-
pipeline_tag: sentence-similarity
|
| 14 |
-
model-index:
|
| 15 |
-
- name: MiniEmbed-Mini
|
| 16 |
-
results: []
|
| 17 |
-
---
|
| 18 |
-
|
| 19 |
-
# MiniEmbed: Tiny, Powerful Embedding Models from Scratch
|
| 20 |
-
|
| 21 |
-
**MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.
|
| 22 |
-
|
| 23 |
-
**GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation)
|
| 24 |
-
|
| 25 |
-
| Spec | Value |
|
| 26 |
-
|---|---|
|
| 27 |
-
| Parameters | ~10.8M |
|
| 28 |
-
| Model Size | ~42 MB |
|
| 29 |
-
| Embedding Dim | 256 |
|
| 30 |
-
| Vocab Size | 30,000 |
|
| 31 |
-
| Max Seq Length | 128 tokens |
|
| 32 |
-
| Architecture | 4-layer Transformer Encoder |
|
| 33 |
-
| Pooling | Mean Pooling + L2 Normalization |
|
| 34 |
-
| Training Loss | MNRL (Multiple Negatives Ranking Loss) |
|
| 35 |
-
| Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) |
|
| 36 |
-
|
| 37 |
-
## Quick Start
|
| 38 |
-
|
| 39 |
-
```bash
|
| 40 |
-
pip install torch numpy scikit-learn huggingface_hub
|
| 41 |
-
```
|
| 42 |
-
|
| 43 |
-
```python
|
| 44 |
-
from huggingface_hub import snapshot_download
|
| 45 |
-
|
| 46 |
-
# Download model (one-time)
|
| 47 |
-
model_dir = snapshot_download("surazbhandari/miniembed")
|
| 48 |
-
|
| 49 |
-
# Add src to path
|
| 50 |
-
import sys
|
| 51 |
-
sys.path.insert(0, model_dir)
|
| 52 |
-
|
| 53 |
-
from src.inference import EmbeddingInference
|
| 54 |
-
|
| 55 |
-
# Load -- just like sentence-transformers!
|
| 56 |
-
model = EmbeddingInference.from_pretrained("surazbhandari/miniembed")
|
| 57 |
-
|
| 58 |
-
# Similarity
|
| 59 |
-
score = model.similarity("Machine learning is great", "AI is wonderful")
|
| 60 |
-
print(f"Similarity: {score:.4f}") # 0.4287
|
| 61 |
-
|
| 62 |
-
# Semantic Search
|
| 63 |
-
docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
|
| 64 |
-
results = model.search("deep learning frameworks", docs, top_k=2)
|
| 65 |
-
for r in results:
|
| 66 |
-
print(f" [{r['score']:.3f}] {r['text']}")
|
| 67 |
-
# [0.498] Neural networks learn patterns
|
| 68 |
-
# [0.413] Python is great for AI
|
| 69 |
-
|
| 70 |
-
# Clustering
|
| 71 |
-
result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
|
| 72 |
-
# Cluster 1: ['Pizza is food']
|
| 73 |
-
# Cluster 2: ['ML is cool', 'AI rocks']
|
| 74 |
-
```
|
| 75 |
-
|
| 76 |
-
## Also Available via GitHub
|
| 77 |
-
|
| 78 |
-
```bash
|
| 79 |
-
git clone https://github.com/bhandarisuraz/miniembed.git
|
| 80 |
-
cd miniembed
|
| 81 |
-
pip install -r requirements.txt
|
| 82 |
-
|
| 83 |
-
python -c "
|
| 84 |
-
from src.inference import EmbeddingInference
|
| 85 |
-
model = EmbeddingInference.from_pretrained('models/mini')
|
| 86 |
-
print(model.similarity('hello world', 'hi there'))
|
| 87 |
-
"
|
| 88 |
-
```
|
| 89 |
-
|
| 90 |
-
## Capabilities
|
| 91 |
-
|
| 92 |
-
- **Semantic Search** -- Find meaning-based matches, not keyword overlap.
|
| 93 |
-
- **Re-Ranking** -- Sort candidates by true semantic relevance.
|
| 94 |
-
- **Clustering** -- Group texts into logical categories automatically.
|
| 95 |
-
- **Product Matching** -- Match items across platforms with messy titles.
|
| 96 |
-
|
| 97 |
-
## Architecture
|
| 98 |
-
|
| 99 |
-
Custom 4-layer Transformer encoder built from first principles:
|
| 100 |
-
|
| 101 |
-
- Token Embedding (30K vocab) + Sinusoidal Positional Encoding
|
| 102 |
-
- 4x Pre-LayerNorm Transformer Encoder Layers
|
| 103 |
-
- Multi-Head Self-Attention (4 heads, d_k=64)
|
| 104 |
-
- Position-wise Feed-Forward (GELU activation, d_ff=1024)
|
| 105 |
-
- Mean Pooling over non-padded tokens
|
| 106 |
-
- L2 Normalization (unit hypersphere projection)
|
| 107 |
-
|
| 108 |
-
## Training
|
| 109 |
-
|
| 110 |
-
Trained on ~3.8 million text pairs from public datasets:
|
| 111 |
-
|
| 112 |
-
| Dataset | Type |
|
| 113 |
-
|---|---|
|
| 114 |
-
| Natural Questions (NQ) | Q&A / General |
|
| 115 |
-
| GooAQ | Knowledge Search |
|
| 116 |
-
| WDC Product Matching | E-commerce |
|
| 117 |
-
| ECInstruct | E-commerce Tasks |
|
| 118 |
-
| MS MARCO | Web Search |
|
| 119 |
-
|
| 120 |
-
**Training details:**
|
| 121 |
-
- Training time: ~49 hours
|
| 122 |
-
- Final loss: 0.0748
|
| 123 |
-
- Optimizer: AdamW
|
| 124 |
-
- Batch size: 256
|
| 125 |
-
|
| 126 |
-
## Files
|
| 127 |
-
|
| 128 |
-
```
|
| 129 |
-
surazbhandari/miniembed
|
| 130 |
-
|-- README.md # This model card
|
| 131 |
-
|-- config.json # Architecture config
|
| 132 |
-
|-- model.safetensors # Pre-trained weights (Safe & Fast)
|
| 133 |
-
|-- model.pt # Pre-trained weights (Legacy PyTorch)
|
| 134 |
-
|-- tokenizer.json # 30K word-level vocabulary
|
| 135 |
-
|-- training_info.json # Training metadata
|
| 136 |
-
|-- src/
|
| 137 |
-
|-- __init__.py
|
| 138 |
-
|-- model.py # Full architecture code
|
| 139 |
-
|-- tokenizer.py # Tokenizer implementation
|
| 140 |
-
|-- inference.py # High-level API (supports HF auto-download)
|
| 141 |
-
```
|
| 142 |
-
|
| 143 |
-
## Limitations
|
| 144 |
-
|
| 145 |
-
- Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
|
| 146 |
-
- 128 token max sequence length
|
| 147 |
-
- Trained primarily on English text
|
| 148 |
-
- Best suited for short-form text (queries, product titles, sentences)
|
| 149 |
-
|
| 150 |
-
## Citation
|
| 151 |
-
|
| 152 |
-
```bibtex
|
| 153 |
-
@software{Bhandari_MiniEmbed_2026,
|
| 154 |
-
author = {Bhandari, Suraj},
|
| 155 |
-
title = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}},
|
| 156 |
-
url = {https://github.com/bhandarisuraz/miniembed},
|
| 157 |
-
version = {1.0.0},
|
| 158 |
-
year = {2026}
|
| 159 |
-
}
|
| 160 |
-
```
|
| 161 |
-
|
| 162 |
-
## License
|
| 163 |
-
|
| 164 |
-
MIT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -1,206 +1,159 @@
|
|
| 1 |
-
# MiniEmbed: Tiny, Powerful Embedding Models from Scratch
|
| 2 |
-
|
| 3 |
-
**MiniEmbed** is a research-grade toolkit for training and deploying ultra-compact text embedding models (Bi-Encoders) built entirely from scratch in PyTorch. While the industry chases billion-parameter giants, MiniEmbed proves that a **~42 MB / 10.8M parameter** model can deliver production-grade semantic intelligence for specialized domains.
|
| 4 |
-
|
| 5 |
-
[](LICENSE)
|
| 6 |
-
[](https://python.org)
|
| 7 |
-
[](https://pytorch.org)
|
| 8 |
-
[](https://huggingface.co/surazbhandari/miniembed)
|
| 9 |
-
|
| 10 |
---
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
---
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
-
miniembed/
|
| 28 |
-
|-- README.md # You are here
|
| 29 |
-
|-- LICENSE # MIT License
|
| 30 |
-
|-- requirements.txt # Python dependencies
|
| 31 |
-
|-- demo.py # Interactive Streamlit demo
|
| 32 |
-
|-- src/ # Core library
|
| 33 |
-
| |-- __init__.py
|
| 34 |
-
| |-- model.py # Transformer architecture (from scratch)
|
| 35 |
-
| |-- tokenizer.py # Custom word-level tokenizer
|
| 36 |
-
| |-- inference.py # High-level API for encoding & search
|
| 37 |
-
|-- models/
|
| 38 |
-
| |-- mini/ # Pre-trained Mini model
|
| 39 |
-
| |-- model.safetensors # Pre-trained weights (Safe & Fast)
|
| 40 |
-
| |-- model.pt # Pre-trained weights (Legacy)
|
| 41 |
-
| |-- config.json # Architecture blueprint
|
| 42 |
-
| |-- tokenizer.json # 30K vocabulary
|
| 43 |
-
| |-- training_info.json # Training metadata
|
| 44 |
-
|-- examples/ # Ready-to-run scripts
|
| 45 |
-
| |-- basic_usage.py # Encoding & similarity
|
| 46 |
-
| |-- semantic_search.py # Document retrieval
|
| 47 |
-
| |-- clustering.py # Text clustering with K-Means
|
| 48 |
-
|-- data/
|
| 49 |
-
|-- sample_data.jsonl # 10-pair demo dataset
|
| 50 |
-
```
|
| 51 |
|
| 52 |
-
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
## Quick Start
|
| 57 |
|
| 58 |
-
### 1. Install Dependencies
|
| 59 |
```bash
|
| 60 |
-
|
| 61 |
-
cd miniembed
|
| 62 |
-
pip install -r requirements.txt
|
| 63 |
```
|
| 64 |
|
| 65 |
-
### 2. Use the Model
|
| 66 |
-
|
| 67 |
-
The pre-trained Mini model is included in `models/mini/`. Alternatively, you can load it directly from Hugging Face:
|
| 68 |
-
|
| 69 |
```python
|
| 70 |
-
from
|
| 71 |
|
| 72 |
-
#
|
| 73 |
-
|
| 74 |
|
| 75 |
-
#
|
| 76 |
-
|
| 77 |
-
|
| 78 |
|
| 79 |
-
### 3. Try It Instantly
|
| 80 |
-
```python
|
| 81 |
from src.inference import EmbeddingInference
|
| 82 |
|
| 83 |
-
|
|
|
|
| 84 |
|
| 85 |
-
# Similarity
|
| 86 |
score = model.similarity("Machine learning is great", "AI is wonderful")
|
| 87 |
print(f"Similarity: {score:.4f}") # 0.4287
|
| 88 |
|
| 89 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
|
| 91 |
results = model.search("deep learning frameworks", docs, top_k=2)
|
| 92 |
for r in results:
|
| 93 |
print(f" [{r['score']:.3f}] {r['text']}")
|
| 94 |
-
|
|
|
|
| 95 |
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
|
|
|
| 99 |
```
|
| 100 |
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
## Interactive Demo (`demo.py`)
|
| 104 |
-
|
| 105 |
-
A full-featured Streamlit dashboard for exploring the model's capabilities without writing code:
|
| 106 |
-
|
| 107 |
-
- **Similarity** -- Real-time cosine similarity between any two texts.
|
| 108 |
-
- **Semantic Search** -- Rank a custom document set against your query.
|
| 109 |
-
- **Clustering** -- Automatically categorize items using K-Means.
|
| 110 |
-
- **Text Encoding** -- Inspect raw 256-D vectors and their statistics.
|
| 111 |
-
- **CSV Matcher** -- Match records between two CSV files for deduplication or cross-platform product mapping.
|
| 112 |
|
| 113 |
```bash
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
---
|
| 118 |
-
|
| 119 |
-
## Architecture
|
| 120 |
-
|
| 121 |
-
MiniEmbed uses a **custom 4-layer Transformer encoder** built from scratch -- no HuggingFace, no pre-trained weights:
|
| 122 |
-
|
| 123 |
-
| Component | Specification |
|
| 124 |
-
|---|---|
|
| 125 |
-
| Embedding Dimension | 256 |
|
| 126 |
-
| Attention Heads | 4 |
|
| 127 |
-
| Transformer Layers | 4 |
|
| 128 |
-
| Feed-Forward Dimension | 1,024 |
|
| 129 |
-
| Vocabulary Size | 30,000 |
|
| 130 |
-
| Max Sequence Length | 128 tokens |
|
| 131 |
-
| Total Parameters | ~10.8M |
|
| 132 |
-
| Model Size on Disk | ~42 MB |
|
| 133 |
-
| Pooling Strategy | Mean Pooling + L2 Normalization |
|
| 134 |
-
|
| 135 |
-
### Training Objective
|
| 136 |
-
|
| 137 |
-
Training uses **Multiple Negatives Ranking Loss (MNRL)**, the industry-standard contrastive objective for Bi-Encoders:
|
| 138 |
-
|
| 139 |
-
$$\mathcal{L} = -\sum_{i=1}^{n} \log \frac{e^{sim(q_i, p_i) / \tau}}{\sum_{j=1}^{n} e^{sim(q_i, p_j) / \tau}}$$
|
| 140 |
-
|
| 141 |
-
All embeddings are **L2-normalized**, projecting text onto a unit hypersphere where cosine similarity equals dot product -- enabling ultra-fast nearest-neighbor search.
|
| 142 |
|
| 143 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
|
| 145 |
-
##
|
| 146 |
|
| 147 |
-
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
-
|
| 150 |
-
|---|---|---|
|
| 151 |
-
| **Natural Questions (NQ)** | Q&A / General | [HuggingFace](https://huggingface.co/datasets/google-research-datasets/natural_questions) |
|
| 152 |
-
| **GooAQ** | Knowledge Search | [HuggingFace](https://huggingface.co/datasets/sentence-transformers/gooaq) |
|
| 153 |
-
| **WDC Product Matching** | E-commerce | [HuggingFace](https://huggingface.co/datasets/wdc/products-2017) |
|
| 154 |
-
| **ECInstruct** | E-commerce Tasks | [HuggingFace](https://huggingface.co/datasets/NingLab/ECInstruct) |
|
| 155 |
-
| **MS MARCO** | Web Search | [HuggingFace](https://huggingface.co/datasets/microsoft/ms_marco) |
|
| 156 |
|
| 157 |
-
|
| 158 |
|
| 159 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 160 |
|
| 161 |
-
##
|
| 162 |
|
| 163 |
-
|
| 164 |
|
| 165 |
-
|
|
| 166 |
|---|---|
|
| 167 |
-
|
|
| 168 |
-
|
|
| 169 |
-
|
|
| 170 |
-
|
|
| 171 |
-
|
|
| 172 |
-
|
| 173 |
-
---
|
| 174 |
-
|
| 175 |
-
## Examples
|
| 176 |
|
| 177 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
|
| 179 |
-
|
| 180 |
-
cd examples
|
| 181 |
-
|
| 182 |
-
# Basic encoding and similarity
|
| 183 |
-
python basic_usage.py
|
| 184 |
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
#
|
| 189 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
```
|
| 191 |
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
## Roadmap
|
| 195 |
-
|
| 196 |
-
- **mini-product** -- A further fine-tuned version of the Mini model, specialized for high-accuracy **product matching** is Coming soon...
|
| 197 |
|
| 198 |
-
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
## Citation
|
| 201 |
|
| 202 |
-
If you use MiniEmbed in your research, please cite:
|
| 203 |
-
|
| 204 |
```bibtex
|
| 205 |
@software{Bhandari_MiniEmbed_2026,
|
| 206 |
author = {Bhandari, Suraj},
|
|
@@ -211,10 +164,6 @@ If you use MiniEmbed in your research, please cite:
|
|
| 211 |
}
|
| 212 |
```
|
| 213 |
|
| 214 |
-
---
|
| 215 |
-
|
| 216 |
## License
|
| 217 |
|
| 218 |
-
MIT
|
| 219 |
-
|
| 220 |
-
Explore, learn, and build smaller, smarter AI.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language: en
|
| 3 |
+
license: mit
|
| 4 |
+
tags:
|
| 5 |
+
- text-embedding
|
| 6 |
+
- sentence-similarity
|
| 7 |
+
- semantic-search
|
| 8 |
+
- product-matching
|
| 9 |
+
- transformer
|
| 10 |
+
- pytorch
|
| 11 |
+
- from-scratch
|
| 12 |
+
library_name: pytorch
|
| 13 |
+
pipeline_tag: sentence-similarity
|
| 14 |
+
model-index:
|
| 15 |
+
- name: MiniEmbed-Mini
|
| 16 |
+
results: []
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# MiniEmbed: Tiny, Powerful Embedding Models from Scratch
|
| 20 |
|
| 21 |
+
**MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
**GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation)
|
| 24 |
|
| 25 |
+
| Spec | Value |
|
| 26 |
+
|---|---|
|
| 27 |
+
| Parameters | ~10.8M |
|
| 28 |
+
| Model Size | ~42 MB |
|
| 29 |
+
| Embedding Dim | 256 |
|
| 30 |
+
| Vocab Size | 30,000 |
|
| 31 |
+
| Max Seq Length | 128 tokens |
|
| 32 |
+
| Architecture | 4-layer Transformer Encoder |
|
| 33 |
+
| Pooling | Mean Pooling + L2 Normalization |
|
| 34 |
+
| Training Loss | MNRL (Multiple Negatives Ranking Loss) |
|
| 35 |
+
| Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) |
|
| 36 |
|
| 37 |
## Quick Start
|
| 38 |
|
|
|
|
| 39 |
```bash
|
| 40 |
+
pip install torch numpy scikit-learn huggingface_hub
|
|
|
|
|
|
|
| 41 |
```
|
| 42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
```python
|
| 44 |
+
from huggingface_hub import snapshot_download
|
| 45 |
|
| 46 |
+
# Download model (one-time)
|
| 47 |
+
model_dir = snapshot_download("surazbhandari/miniembed")
|
| 48 |
|
| 49 |
+
# Add src to path
|
| 50 |
+
import sys
|
| 51 |
+
sys.path.insert(0, model_dir)
|
| 52 |
|
|
|
|
|
|
|
| 53 |
from src.inference import EmbeddingInference
|
| 54 |
|
| 55 |
+
# Load -- just like sentence-transformers!
|
| 56 |
+
model = EmbeddingInference.from_pretrained("surazbhandari/miniembed")
|
| 57 |
|
| 58 |
+
# 1. Similarity
|
| 59 |
score = model.similarity("Machine learning is great", "AI is wonderful")
|
| 60 |
print(f"Similarity: {score:.4f}") # 0.4287
|
| 61 |
|
| 62 |
+
# 2. Normal Embeddings
|
| 63 |
+
embeddings = model.encode(["Machine learning is great", "AI is wonderful"])
|
| 64 |
+
import numpy as np
|
| 65 |
+
manual_score = np.dot(embeddings[0], embeddings[1]) # Dot product = Cosine Similarity
|
| 66 |
+
|
| 67 |
+
# 3. Semantic Search
|
| 68 |
docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
|
| 69 |
results = model.search("deep learning frameworks", docs, top_k=2)
|
| 70 |
for r in results:
|
| 71 |
print(f" [{r['score']:.3f}] {r['text']}")
|
| 72 |
+
# [0.498] Neural networks learn patterns
|
| 73 |
+
# [0.413] Python is great for AI
|
| 74 |
|
| 75 |
+
# 4. Clustering
|
| 76 |
+
result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
|
| 77 |
+
# Cluster 1: ['Pizza is food']
|
| 78 |
+
# Cluster 2: ['ML is cool', 'AI rocks']
|
| 79 |
```
|
| 80 |
|
| 81 |
+
## Also Available via GitHub
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
```bash
|
| 84 |
+
git clone https://github.com/bhandarisuraz/miniembed.git
|
| 85 |
+
cd miniembed
|
| 86 |
+
pip install -r requirements.txt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
+
python -c "
|
| 89 |
+
from src.inference import EmbeddingInference
|
| 90 |
+
model = EmbeddingInference.from_pretrained('models/mini')
|
| 91 |
+
print(model.similarity('hello world', 'hi there'))
|
| 92 |
+
"
|
| 93 |
+
```
|
| 94 |
|
| 95 |
+
## Capabilities
|
| 96 |
|
| 97 |
+
- **Semantic Search** -- Find meaning-based matches, not keyword overlap.
|
| 98 |
+
- **Re-Ranking** -- Sort candidates by true semantic relevance.
|
| 99 |
+
- **Clustering** -- Group texts into logical categories automatically.
|
| 100 |
+
- **Product Matching** -- Match items across platforms with messy titles.
|
| 101 |
|
| 102 |
+
## Architecture
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
+
Custom 4-layer Transformer encoder built from first principles:
|
| 105 |
|
| 106 |
+
- Token Embedding (30K vocab) + Sinusoidal Positional Encoding
|
| 107 |
+
- 4x Pre-LayerNorm Transformer Encoder Layers
|
| 108 |
+
- Multi-Head Self-Attention (4 heads, d_k=64)
|
| 109 |
+
- Position-wise Feed-Forward (GELU activation, d_ff=1024)
|
| 110 |
+
- Mean Pooling over non-padded tokens
|
| 111 |
+
- L2 Normalization (unit hypersphere projection)
|
| 112 |
|
| 113 |
+
## Training
|
| 114 |
|
| 115 |
+
Trained on ~3.8 million text pairs from public datasets:
|
| 116 |
|
| 117 |
+
| Dataset | Type |
|
| 118 |
|---|---|
|
| 119 |
+
| Natural Questions (NQ) | Q&A / General |
|
| 120 |
+
| GooAQ | Knowledge Search |
|
| 121 |
+
| WDC Product Matching | E-commerce |
|
| 122 |
+
| ECInstruct | E-commerce Tasks |
|
| 123 |
+
| MS MARCO | Web Search |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
+
**Training details:**
|
| 126 |
+
- Training time: ~49 hours
|
| 127 |
+
- Final loss: 0.0748
|
| 128 |
+
- Optimizer: AdamW
|
| 129 |
+
- Batch size: 256
|
| 130 |
|
| 131 |
+
## Files
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
|
| 133 |
+
```
|
| 134 |
+
surazbhandari/miniembed
|
| 135 |
+
|-- README.md # This model card
|
| 136 |
+
|-- config.json # Architecture config
|
| 137 |
+
|-- model.safetensors # Pre-trained weights (Safe & Fast)
|
| 138 |
+
|-- model.pt # Pre-trained weights (Legacy PyTorch)
|
| 139 |
+
|-- tokenizer.json # 30K word-level vocabulary
|
| 140 |
+
|-- training_info.json # Training metadata
|
| 141 |
+
|-- src/
|
| 142 |
+
|-- __init__.py
|
| 143 |
+
|-- model.py # Full architecture code
|
| 144 |
+
|-- tokenizer.py # Tokenizer implementation
|
| 145 |
+
|-- inference.py # High-level API (supports HF auto-download)
|
| 146 |
```
|
| 147 |
|
| 148 |
+
## Limitations
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
+
- Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
|
| 151 |
+
- 128 token max sequence length
|
| 152 |
+
- Trained primarily on English text
|
| 153 |
+
- Best suited for short-form text (queries, product titles, sentences)
|
| 154 |
|
| 155 |
## Citation
|
| 156 |
|
|
|
|
|
|
|
| 157 |
```bibtex
|
| 158 |
@software{Bhandari_MiniEmbed_2026,
|
| 159 |
author = {Bhandari, Suraj},
|
|
|
|
| 164 |
}
|
| 165 |
```
|
| 166 |
|
|
|
|
|
|
|
| 167 |
## License
|
| 168 |
|
| 169 |
+
MIT
|
|
|
|
|
|
models/mini/config.json → config.json
RENAMED
|
File without changes
|
data/sample_data.jsonl
DELETED
|
@@ -1,10 +0,0 @@
|
|
| 1 |
-
{"query": "how to train an embedding model", "passage": "Training an embedding model involves using contrastive learning on query-passage pairs.", "source": "sample"}
|
| 2 |
-
{"query": "what is a transformer", "passage": "The Transformer is a deep learning model that uses self-attention mechanisms to process sequence data.", "source": "sample"}
|
| 3 |
-
{"query": "nike air max 90", "passage": "Men's Nike Air Max 90 Casual Shoes in Black and White.", "source": "sample"}
|
| 4 |
-
{"query": "samsung galaxy s21", "passage": "Samsung Galaxy S21 5G 128GB Unlocked Smartphone - Phantom Gray.", "source": "sample"}
|
| 5 |
-
{"query": "best winter coats", "passage": "The North Face Gotham Jacket III is one of the warmest winter parkas for heavy snow.", "source": "sample"}
|
| 6 |
-
{"query": "python programming for beginners", "passage": "Learn Python with this comprehensive guide covering variables, loops, and functions.", "source": "sample"}
|
| 7 |
-
{"query": "benefits of meditation", "passage": "Meditation can reduce stress, improve concentration, and increase happiness.", "source": "sample"}
|
| 8 |
-
{"query": "how to bake chocolate cake", "passage": "Whisk eggs and sugar, then fold in flour and melted chocolate for a perfect moist cake.", "source": "sample"}
|
| 9 |
-
{"query": "what is machine learning", "passage": "Machine learning is a field of AI that allows systems to learn patterns from data without explicit programming.", "source": "sample"}
|
| 10 |
-
{"query": "running shoes for flat feet", "passage": "Brooks Adrenaline GTS 22 provides excellent stability and support for runners with low arches.", "source": "sample"}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
demo.py
DELETED
|
@@ -1,510 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
MiniEmbed - Interactive Demo
|
| 3 |
-
================================
|
| 4 |
-
Explore the embedding model's capabilities through a Streamlit dashboard.
|
| 5 |
-
|
| 6 |
-
Features:
|
| 7 |
-
- Pairwise text similarity (cosine distance)
|
| 8 |
-
- Semantic document search with ranked results
|
| 9 |
-
- Unsupervised text clustering via K-Means
|
| 10 |
-
- Raw embedding vector inspection and visualization
|
| 11 |
-
- Bulk CSV-to-CSV record matching
|
| 12 |
-
|
| 13 |
-
Run: streamlit run demo.py
|
| 14 |
-
"""
|
| 15 |
-
|
| 16 |
-
import streamlit as st
|
| 17 |
-
import numpy as np
|
| 18 |
-
import pandas as pd
|
| 19 |
-
import os
|
| 20 |
-
import sys
|
| 21 |
-
import io
|
| 22 |
-
|
| 23 |
-
# Add src to path
|
| 24 |
-
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 25 |
-
|
| 26 |
-
from src.inference import EmbeddingInference, EmbeddingModelManager
|
| 27 |
-
|
| 28 |
-
# ============================================================================
|
| 29 |
-
# PAGE CONFIG
|
| 30 |
-
# ============================================================================
|
| 31 |
-
|
| 32 |
-
st.set_page_config(
|
| 33 |
-
page_title="MiniEmbed Demo",
|
| 34 |
-
page_icon="M",
|
| 35 |
-
layout="wide"
|
| 36 |
-
)
|
| 37 |
-
|
| 38 |
-
# Custom CSS
|
| 39 |
-
st.markdown("""
|
| 40 |
-
<style>
|
| 41 |
-
.main-header {
|
| 42 |
-
font-size: 2.5rem;
|
| 43 |
-
font-weight: 700;
|
| 44 |
-
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
| 45 |
-
-webkit-background-clip: text;
|
| 46 |
-
-webkit-text-fill-color: transparent;
|
| 47 |
-
text-align: center;
|
| 48 |
-
margin-bottom: 1rem;
|
| 49 |
-
}
|
| 50 |
-
.sub-header {
|
| 51 |
-
text-align: center;
|
| 52 |
-
color: #888;
|
| 53 |
-
margin-bottom: 2rem;
|
| 54 |
-
}
|
| 55 |
-
.result-box {
|
| 56 |
-
background: rgba(100, 100, 100, 0.1);
|
| 57 |
-
border-radius: 10px;
|
| 58 |
-
padding: 1rem;
|
| 59 |
-
margin: 0.5rem 0;
|
| 60 |
-
color: inherit;
|
| 61 |
-
}
|
| 62 |
-
.high-score { border-left: 4px solid #28a745; background: rgba(40, 167, 69, 0.1); }
|
| 63 |
-
.medium-score { border-left: 4px solid #ffc107; background: rgba(255, 193, 7, 0.1); }
|
| 64 |
-
.low-score { border-left: 4px solid #dc3545; background: rgba(220, 53, 69, 0.1); }
|
| 65 |
-
.score-text { font-weight: bold; }
|
| 66 |
-
</style>
|
| 67 |
-
""", unsafe_allow_html=True)
|
| 68 |
-
|
| 69 |
-
# ============================================================================
|
| 70 |
-
# LOAD MODEL
|
| 71 |
-
# ============================================================================
|
| 72 |
-
|
| 73 |
-
@st.cache_resource
|
| 74 |
-
def load_model(model_name):
|
| 75 |
-
"""Load the embedding model from disk."""
|
| 76 |
-
model_dir = f"models/{model_name}"
|
| 77 |
-
if model_name == "Legacy (model/)":
|
| 78 |
-
model_dir = "model"
|
| 79 |
-
return EmbeddingInference.from_pretrained(model_dir)
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
# Header
|
| 83 |
-
st.markdown('<h1 class="main-header">MiniEmbed Demo</h1>', unsafe_allow_html=True)
|
| 84 |
-
st.markdown('<p class="sub-header">Explore semantic similarity, search, clustering, and bulk matching</p>', unsafe_allow_html=True)
|
| 85 |
-
|
| 86 |
-
# -----------------------------------------------------------------------------
|
| 87 |
-
# Model Selection
|
| 88 |
-
# -----------------------------------------------------------------------------
|
| 89 |
-
available_models = EmbeddingModelManager.list_models()
|
| 90 |
-
if os.path.exists("model/model.pt"):
|
| 91 |
-
available_models.append("Legacy (model/)")
|
| 92 |
-
|
| 93 |
-
if not available_models:
|
| 94 |
-
st.error("No models found. Train a model first or place weights in models/mini/model.pt.")
|
| 95 |
-
st.info("Models should be located in the `models/` directory (e.g., `models/mini/`).")
|
| 96 |
-
st.stop()
|
| 97 |
-
|
| 98 |
-
selected_model_name = st.sidebar.selectbox(
|
| 99 |
-
"Select Model",
|
| 100 |
-
available_models,
|
| 101 |
-
index=0,
|
| 102 |
-
help="Select which trained model to load for inference."
|
| 103 |
-
)
|
| 104 |
-
|
| 105 |
-
model = load_model(selected_model_name)
|
| 106 |
-
|
| 107 |
-
if model is None:
|
| 108 |
-
st.error("Model not found. Please train the model first.")
|
| 109 |
-
st.stop()
|
| 110 |
-
|
| 111 |
-
# Model info
|
| 112 |
-
with st.expander("Model Info", expanded=False):
|
| 113 |
-
st.markdown("""
|
| 114 |
-
This panel shows the architecture of the currently loaded model.
|
| 115 |
-
- **Embedding Dim**: The size of each output vector (higher = more expressive).
|
| 116 |
-
- **Layers**: Number of Transformer encoder layers stacked in the model.
|
| 117 |
-
- **Vocab Size**: Total number of unique tokens the model can recognize.
|
| 118 |
-
""")
|
| 119 |
-
col1, col2, col3 = st.columns(3)
|
| 120 |
-
with col1:
|
| 121 |
-
st.metric("Embedding Dim", model.model.d_model)
|
| 122 |
-
with col2:
|
| 123 |
-
st.metric("Layers", len(model.model.layers))
|
| 124 |
-
with col3:
|
| 125 |
-
st.metric("Vocab Size", len(model.tokenizer.word_to_id))
|
| 126 |
-
|
| 127 |
-
# ============================================================================
|
| 128 |
-
# TABS
|
| 129 |
-
# ============================================================================
|
| 130 |
-
|
| 131 |
-
tab1, tab2, tab3, tab4, tab5 = st.tabs([
|
| 132 |
-
"Similarity",
|
| 133 |
-
"Semantic Search",
|
| 134 |
-
"Clustering",
|
| 135 |
-
"Encode Text",
|
| 136 |
-
"CSV Matcher"
|
| 137 |
-
])
|
| 138 |
-
|
| 139 |
-
# ============================================================================
|
| 140 |
-
# TAB 1: SIMILARITY
|
| 141 |
-
# ============================================================================
|
| 142 |
-
|
| 143 |
-
with tab1:
|
| 144 |
-
st.markdown("### Pairwise Text Similarity")
|
| 145 |
-
st.markdown("""
|
| 146 |
-
Enter two texts to compute their **cosine similarity** (range: 0 to 1).
|
| 147 |
-
The model encodes each text into a 256-dimensional vector and measures
|
| 148 |
-
the angular distance between them. A score close to 1.0 means the texts
|
| 149 |
-
are semantically equivalent; a score near 0.0 means they are unrelated.
|
| 150 |
-
""")
|
| 151 |
-
|
| 152 |
-
col1, col2 = st.columns(2)
|
| 153 |
-
|
| 154 |
-
with col1:
|
| 155 |
-
text1 = st.text_area(
|
| 156 |
-
"Text 1",
|
| 157 |
-
"Machine learning is a branch of artificial intelligence",
|
| 158 |
-
height=100,
|
| 159 |
-
key="sim_text1"
|
| 160 |
-
)
|
| 161 |
-
|
| 162 |
-
with col2:
|
| 163 |
-
text2 = st.text_area(
|
| 164 |
-
"Text 2",
|
| 165 |
-
"AI systems can learn patterns from data",
|
| 166 |
-
height=100,
|
| 167 |
-
key="sim_text2"
|
| 168 |
-
)
|
| 169 |
-
|
| 170 |
-
if st.button("Compute Similarity", type="primary", key="sim_btn"):
|
| 171 |
-
if text1 and text2:
|
| 172 |
-
with st.spinner("Computing..."):
|
| 173 |
-
similarity = model.similarity(text1, text2)
|
| 174 |
-
|
| 175 |
-
if similarity > 0.7:
|
| 176 |
-
color = "#28a745"
|
| 177 |
-
label = "Very Similar"
|
| 178 |
-
elif similarity > 0.4:
|
| 179 |
-
color = "#ffc107"
|
| 180 |
-
label = "Somewhat Similar"
|
| 181 |
-
else:
|
| 182 |
-
color = "#dc3545"
|
| 183 |
-
label = "Not Similar"
|
| 184 |
-
|
| 185 |
-
st.markdown(f"""
|
| 186 |
-
<div style="text-align: center; padding: 2rem;">
|
| 187 |
-
<div style="font-size: 4rem; font-weight: bold; color: {color};">
|
| 188 |
-
{similarity:.3f}
|
| 189 |
-
</div>
|
| 190 |
-
<div style="font-size: 1.2rem; color: {color};">
|
| 191 |
-
{label}
|
| 192 |
-
</div>
|
| 193 |
-
</div>
|
| 194 |
-
""", unsafe_allow_html=True)
|
| 195 |
-
|
| 196 |
-
# Example pairs
|
| 197 |
-
st.markdown("---")
|
| 198 |
-
st.markdown("#### Example Pairs")
|
| 199 |
-
st.markdown("These pairs demonstrate how the model distinguishes related from unrelated content:")
|
| 200 |
-
|
| 201 |
-
examples = [
|
| 202 |
-
("Python is a programming language", "Java is used for software development"),
|
| 203 |
-
("The cat sat on the mat", "A feline rested on the rug"),
|
| 204 |
-
("Machine learning is fascinating", "I love eating pizza"),
|
| 205 |
-
]
|
| 206 |
-
|
| 207 |
-
for t1, t2 in examples:
|
| 208 |
-
similarity = model.similarity(t1, t2)
|
| 209 |
-
|
| 210 |
-
if similarity > 0.5:
|
| 211 |
-
css_class = "high-score"
|
| 212 |
-
elif similarity > 0.3:
|
| 213 |
-
css_class = "medium-score"
|
| 214 |
-
else:
|
| 215 |
-
css_class = "low-score"
|
| 216 |
-
|
| 217 |
-
st.markdown(f"""
|
| 218 |
-
<div class="result-box {css_class}">
|
| 219 |
-
<strong>{similarity:.3f}</strong> | "{t1}" vs "{t2}"
|
| 220 |
-
</div>
|
| 221 |
-
""", unsafe_allow_html=True)
|
| 222 |
-
|
| 223 |
-
# ============================================================================
|
| 224 |
-
# TAB 2: SEMANTIC SEARCH
|
| 225 |
-
# ============================================================================
|
| 226 |
-
|
| 227 |
-
with tab2:
|
| 228 |
-
st.markdown("### Semantic Document Search")
|
| 229 |
-
st.markdown("""
|
| 230 |
-
Enter a natural-language query. The model encodes your query and all
|
| 231 |
-
documents into the same vector space, then ranks documents by cosine
|
| 232 |
-
similarity. This finds **meaning-based** matches, not just keyword overlap.
|
| 233 |
-
""")
|
| 234 |
-
|
| 235 |
-
default_docs = """Python is a high-level programming language
|
| 236 |
-
Machine learning algorithms learn patterns from data
|
| 237 |
-
The weather today is sunny and warm
|
| 238 |
-
Neural networks are inspired by the human brain
|
| 239 |
-
JavaScript is used for web development
|
| 240 |
-
Deep learning has transformed computer vision
|
| 241 |
-
Cats are popular pets around the world
|
| 242 |
-
TensorFlow and PyTorch are ML frameworks
|
| 243 |
-
The stock market had a volatile day
|
| 244 |
-
Natural language processing understands text"""
|
| 245 |
-
|
| 246 |
-
query = st.text_input(
|
| 247 |
-
"Search Query",
|
| 248 |
-
"How do AI systems learn from examples?",
|
| 249 |
-
key="search_query"
|
| 250 |
-
)
|
| 251 |
-
|
| 252 |
-
documents_text = st.text_area(
|
| 253 |
-
"Documents (one per line)",
|
| 254 |
-
default_docs,
|
| 255 |
-
height=200,
|
| 256 |
-
key="search_docs"
|
| 257 |
-
)
|
| 258 |
-
|
| 259 |
-
top_k = st.slider("Number of results", 1, 10, 5, key="search_topk")
|
| 260 |
-
|
| 261 |
-
if st.button("Search", type="primary", key="search_btn"):
|
| 262 |
-
documents = [d.strip() for d in documents_text.split('\n') if d.strip()]
|
| 263 |
-
|
| 264 |
-
if query and documents:
|
| 265 |
-
with st.spinner("Searching..."):
|
| 266 |
-
results = model.search(query, documents, top_k=top_k)
|
| 267 |
-
|
| 268 |
-
st.markdown("### Results")
|
| 269 |
-
st.markdown("Documents ranked by semantic relevance to your query:")
|
| 270 |
-
|
| 271 |
-
for r in results:
|
| 272 |
-
score = r['score']
|
| 273 |
-
if score > 0.6:
|
| 274 |
-
indicator = "[HIGH]"
|
| 275 |
-
css_class = "high-score"
|
| 276 |
-
elif score > 0.4:
|
| 277 |
-
indicator = "[MED]"
|
| 278 |
-
css_class = "medium-score"
|
| 279 |
-
else:
|
| 280 |
-
indicator = "[LOW]"
|
| 281 |
-
css_class = "low-score"
|
| 282 |
-
|
| 283 |
-
st.markdown(f"""
|
| 284 |
-
<div class="result-box {css_class}">
|
| 285 |
-
<strong>{indicator} #{r['rank']}</strong> (score: {score:.4f})<br>
|
| 286 |
-
{r['text']}
|
| 287 |
-
</div>
|
| 288 |
-
""", unsafe_allow_html=True)
|
| 289 |
-
|
| 290 |
-
# ============================================================================
|
| 291 |
-
# TAB 3: CLUSTERING
|
| 292 |
-
# ============================================================================
|
| 293 |
-
|
| 294 |
-
with tab3:
|
| 295 |
-
st.markdown("### Unsupervised Text Clustering")
|
| 296 |
-
st.markdown("""
|
| 297 |
-
The model encodes each text into a dense vector. K-Means clustering
|
| 298 |
-
then groups these vectors by proximity in the embedding space.
|
| 299 |
-
Texts that are semantically similar end up in the same cluster,
|
| 300 |
-
even if they share no common words.
|
| 301 |
-
""")
|
| 302 |
-
|
| 303 |
-
default_cluster_texts = """Python programming language
|
| 304 |
-
Machine learning algorithms
|
| 305 |
-
Deep learning neural networks
|
| 306 |
-
JavaScript web development
|
| 307 |
-
Cats and dogs as pets
|
| 308 |
-
Pizza and pasta Italian food
|
| 309 |
-
Sunny weather today
|
| 310 |
-
Rainy day forecast
|
| 311 |
-
Stock market trends
|
| 312 |
-
Financial news update"""
|
| 313 |
-
|
| 314 |
-
cluster_texts = st.text_area(
|
| 315 |
-
"Texts to cluster (one per line)",
|
| 316 |
-
default_cluster_texts,
|
| 317 |
-
height=200,
|
| 318 |
-
key="cluster_texts"
|
| 319 |
-
)
|
| 320 |
-
|
| 321 |
-
n_clusters = st.slider("Number of clusters", 2, 10, 3, key="n_clusters")
|
| 322 |
-
|
| 323 |
-
if st.button("Run Clustering", type="primary", key="cluster_btn"):
|
| 324 |
-
texts = [t.strip() for t in cluster_texts.split('\n') if t.strip()]
|
| 325 |
-
|
| 326 |
-
if len(texts) >= n_clusters:
|
| 327 |
-
with st.spinner("Clustering..."):
|
| 328 |
-
result = model.cluster_texts(texts, n_clusters=n_clusters)
|
| 329 |
-
|
| 330 |
-
st.markdown("### Cluster Assignments")
|
| 331 |
-
st.markdown("Each group contains texts that the model considers semantically related:")
|
| 332 |
-
|
| 333 |
-
colors = ["#667eea", "#28a745", "#ffc107", "#dc3545", "#17a2b8",
|
| 334 |
-
"#6f42c1", "#fd7e14", "#20c997", "#e83e8c", "#6c757d"]
|
| 335 |
-
|
| 336 |
-
for cluster_id in sorted(result['texts_by_cluster'].keys()):
|
| 337 |
-
cluster_texts_list = result['texts_by_cluster'][cluster_id]
|
| 338 |
-
color = colors[cluster_id % len(colors)]
|
| 339 |
-
|
| 340 |
-
st.markdown(f"""
|
| 341 |
-
<div style="background: {color}15; border-left: 4px solid {color};
|
| 342 |
-
padding: 1rem; border-radius: 5px; margin: 0.5rem 0;">
|
| 343 |
-
<strong style="color: {color};">Cluster {cluster_id + 1}</strong>
|
| 344 |
-
({len(cluster_texts_list)} texts)
|
| 345 |
-
</div>
|
| 346 |
-
""", unsafe_allow_html=True)
|
| 347 |
-
|
| 348 |
-
for text in cluster_texts_list:
|
| 349 |
-
st.markdown(f" - {text}")
|
| 350 |
-
else:
|
| 351 |
-
st.warning(f"Need at least {n_clusters} texts to create {n_clusters} clusters.")
|
| 352 |
-
|
| 353 |
-
# ============================================================================
|
| 354 |
-
# TAB 4: ENCODE TEXT
|
| 355 |
-
# ============================================================================
|
| 356 |
-
|
| 357 |
-
with tab4:
|
| 358 |
-
st.markdown("### Raw Embedding Inspector")
|
| 359 |
-
st.markdown("""
|
| 360 |
-
Convert any text into its dense vector representation. The output is a
|
| 361 |
-
256-dimensional float vector that is **L2-normalized** (unit length = 1.0).
|
| 362 |
-
This is the same representation used internally for similarity and search.
|
| 363 |
-
""")
|
| 364 |
-
|
| 365 |
-
encode_text = st.text_area(
|
| 366 |
-
"Text to encode",
|
| 367 |
-
"Machine learning is a fascinating field of study.",
|
| 368 |
-
height=100,
|
| 369 |
-
key="encode_text"
|
| 370 |
-
)
|
| 371 |
-
|
| 372 |
-
if st.button("Encode", type="primary", key="encode_btn"):
|
| 373 |
-
if encode_text:
|
| 374 |
-
with st.spinner("Encoding..."):
|
| 375 |
-
embedding = model.encode(encode_text)
|
| 376 |
-
|
| 377 |
-
st.markdown("### Embedding Vector")
|
| 378 |
-
|
| 379 |
-
col1, col2, col3 = st.columns(3)
|
| 380 |
-
with col1:
|
| 381 |
-
st.metric("Dimensions", embedding.shape[1])
|
| 382 |
-
with col2:
|
| 383 |
-
st.metric("L2 Norm", f"{np.linalg.norm(embedding[0]):.4f}")
|
| 384 |
-
with col3:
|
| 385 |
-
st.metric("Mean Value", f"{embedding[0].mean():.4f}")
|
| 386 |
-
|
| 387 |
-
st.markdown("#### First 20 values:")
|
| 388 |
-
st.code(str(embedding[0][:20].round(4).tolist()))
|
| 389 |
-
|
| 390 |
-
st.markdown("#### Value Distribution")
|
| 391 |
-
st.markdown("A well-trained model produces a roughly Gaussian distribution centered near zero:")
|
| 392 |
-
import plotly.express as px
|
| 393 |
-
fig = px.histogram(
|
| 394 |
-
x=embedding[0],
|
| 395 |
-
nbins=50,
|
| 396 |
-
title="Embedding Value Distribution",
|
| 397 |
-
labels={'x': 'Value', 'y': 'Count'}
|
| 398 |
-
)
|
| 399 |
-
fig.update_layout(showlegend=False)
|
| 400 |
-
st.plotly_chart(fig, width="stretch")
|
| 401 |
-
|
| 402 |
-
# ============================================================================
|
| 403 |
-
# TAB 5: CSV MATCHER
|
| 404 |
-
# ============================================================================
|
| 405 |
-
|
| 406 |
-
with tab5:
|
| 407 |
-
st.markdown("### Bulk CSV Record Matcher")
|
| 408 |
-
st.markdown("""
|
| 409 |
-
Upload two CSV files and match rows across them using semantic similarity.
|
| 410 |
-
This is useful for:
|
| 411 |
-
- **Product deduplication** across e-commerce platforms
|
| 412 |
-
- **Record linkage** between databases with inconsistent naming
|
| 413 |
-
- **Cross-platform mapping** (e.g., matching supplier catalogs to your inventory)
|
| 414 |
-
|
| 415 |
-
The model encodes the selected text column from each CSV, then ranks
|
| 416 |
-
every row in CSV 2 against each row in CSV 1 by cosine similarity.
|
| 417 |
-
""")
|
| 418 |
-
|
| 419 |
-
col1, col2 = st.columns(2)
|
| 420 |
-
|
| 421 |
-
with col1:
|
| 422 |
-
st.markdown("#### Upload CSV 1 (Queries)")
|
| 423 |
-
file1 = st.file_uploader("Upload primary CSV", type=['csv'], key="csv_file_1")
|
| 424 |
-
|
| 425 |
-
with col2:
|
| 426 |
-
st.markdown("#### Upload CSV 2 (Knowledge Base)")
|
| 427 |
-
file2 = st.file_uploader("Upload secondary CSV", type=['csv'], key="csv_file_2")
|
| 428 |
-
|
| 429 |
-
if file1 and file2:
|
| 430 |
-
df1 = pd.read_csv(file1)
|
| 431 |
-
df2 = pd.read_csv(file2)
|
| 432 |
-
|
| 433 |
-
st.markdown("---")
|
| 434 |
-
col_m1, col_m2 = st.columns(2)
|
| 435 |
-
|
| 436 |
-
with col_m1:
|
| 437 |
-
col1_name = st.selectbox("Select column to match from CSV 1", df1.columns, key="col1_sel")
|
| 438 |
-
|
| 439 |
-
with col_m2:
|
| 440 |
-
col2_name = st.selectbox("Select column to search in CSV 2", df2.columns, key="col2_sel")
|
| 441 |
-
|
| 442 |
-
col_p1, col_p2 = st.columns(2)
|
| 443 |
-
with col_p1:
|
| 444 |
-
top_n_candidates = st.slider("Step 1: Top candidates to fetch", 1, 50, 10, help="Initial semantic search depth")
|
| 445 |
-
with col_p2:
|
| 446 |
-
top_m_final = st.slider("Step 2: Top matches to keep", 1, 10, 3, help="Final number of matches per row")
|
| 447 |
-
|
| 448 |
-
if st.button("Start Bulk Matching", type="primary"):
|
| 449 |
-
progress_bar = st.progress(0)
|
| 450 |
-
status_text = st.empty()
|
| 451 |
-
|
| 452 |
-
queries = df1[col1_name].fillna("").astype(str).tolist()
|
| 453 |
-
corpus = df2[col2_name].fillna("").astype(str).tolist()
|
| 454 |
-
|
| 455 |
-
status_text.text("Encoding search corpus (CSV 2)...")
|
| 456 |
-
corpus_embs = model.encode(corpus, batch_size=128)
|
| 457 |
-
progress_bar.progress(20)
|
| 458 |
-
|
| 459 |
-
status_text.text("Encoding queries (CSV 1)...")
|
| 460 |
-
query_embs = model.encode(queries, batch_size=128)
|
| 461 |
-
progress_bar.progress(50)
|
| 462 |
-
|
| 463 |
-
status_text.text("Computing similarities and mapping...")
|
| 464 |
-
similarities = np.dot(query_embs, corpus_embs.T)
|
| 465 |
-
progress_bar.progress(80)
|
| 466 |
-
|
| 467 |
-
all_results = []
|
| 468 |
-
for i in range(len(queries)):
|
| 469 |
-
row_scores = similarities[i]
|
| 470 |
-
top_indices = np.argsort(row_scores)[::-1][:top_m_final]
|
| 471 |
-
|
| 472 |
-
res_row = df1.iloc[i].to_dict()
|
| 473 |
-
for rank, idx in enumerate(top_indices, 1):
|
| 474 |
-
res_row[f'Match_{rank}_{col2_name}'] = corpus[idx]
|
| 475 |
-
res_row[f'Match_{rank}_Score'] = round(float(row_scores[idx]), 4)
|
| 476 |
-
all_results.append(res_row)
|
| 477 |
-
|
| 478 |
-
res_df = pd.DataFrame(all_results)
|
| 479 |
-
|
| 480 |
-
progress_bar.progress(100)
|
| 481 |
-
status_text.text("Matching complete.")
|
| 482 |
-
|
| 483 |
-
st.markdown("### Results Preview")
|
| 484 |
-
st.dataframe(res_df.head(50), width="stretch")
|
| 485 |
-
|
| 486 |
-
output = io.StringIO()
|
| 487 |
-
res_df.to_csv(output, index=False)
|
| 488 |
-
csv_string = output.getvalue()
|
| 489 |
-
|
| 490 |
-
st.download_button(
|
| 491 |
-
label="Download Full Results CSV",
|
| 492 |
-
data=csv_string,
|
| 493 |
-
file_name="semantic_matching_results.csv",
|
| 494 |
-
mime="text/csv",
|
| 495 |
-
)
|
| 496 |
-
else:
|
| 497 |
-
st.info("Upload both CSV files to begin matching.")
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
# ============================================================================
|
| 501 |
-
# FOOTER
|
| 502 |
-
# ============================================================================
|
| 503 |
-
|
| 504 |
-
st.markdown("---")
|
| 505 |
-
st.markdown("""
|
| 506 |
-
<div style="text-align: center; color: #666; padding: 1rem;">
|
| 507 |
-
<strong>MiniEmbed</strong> | Lightweight Text Embeddings |
|
| 508 |
-
<a href="https://github.com/bhandarisuraz/miniembed">GitHub</a>
|
| 509 |
-
</div>
|
| 510 |
-
""", unsafe_allow_html=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
examples/basic_usage.py
DELETED
|
@@ -1,85 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
Basic Usage Example
|
| 3 |
-
===================
|
| 4 |
-
Demonstrates encoding texts and computing similarity using MiniEmbed.
|
| 5 |
-
|
| 6 |
-
This script shows the three core operations:
|
| 7 |
-
1. Encoding raw text into dense vectors
|
| 8 |
-
2. Computing pairwise similarity between two texts
|
| 9 |
-
3. Building a full similarity matrix across sets of texts
|
| 10 |
-
"""
|
| 11 |
-
|
| 12 |
-
import sys
|
| 13 |
-
sys.path.insert(0, '..')
|
| 14 |
-
|
| 15 |
-
from src.inference import EmbeddingInference
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
def main():
|
| 19 |
-
print("=" * 60)
|
| 20 |
-
print("MiniEmbed - Basic Usage Example")
|
| 21 |
-
print("=" * 60)
|
| 22 |
-
|
| 23 |
-
# Load the model
|
| 24 |
-
print("\nLoading model...")
|
| 25 |
-
model = EmbeddingInference.from_pretrained("../models/mini")
|
| 26 |
-
print("Model loaded.\n")
|
| 27 |
-
|
| 28 |
-
# -------------------------------------------------------------------------
|
| 29 |
-
# Example 1: Encode texts
|
| 30 |
-
# -------------------------------------------------------------------------
|
| 31 |
-
print("-" * 40)
|
| 32 |
-
print("Example 1: Encoding Texts")
|
| 33 |
-
print("-" * 40)
|
| 34 |
-
|
| 35 |
-
texts = [
|
| 36 |
-
"Machine learning is a branch of artificial intelligence",
|
| 37 |
-
"Deep learning uses neural networks with many layers",
|
| 38 |
-
"I love eating pizza on weekends"
|
| 39 |
-
]
|
| 40 |
-
|
| 41 |
-
embeddings = model.encode(texts)
|
| 42 |
-
print(f"Input: {len(texts)} texts")
|
| 43 |
-
print(f"Output: {embeddings.shape}") # (3, 256)
|
| 44 |
-
|
| 45 |
-
# -------------------------------------------------------------------------
|
| 46 |
-
# Example 2: Compute similarity
|
| 47 |
-
# -------------------------------------------------------------------------
|
| 48 |
-
print("\n" + "-" * 40)
|
| 49 |
-
print("Example 2: Computing Similarity")
|
| 50 |
-
print("-" * 40)
|
| 51 |
-
|
| 52 |
-
pairs = [
|
| 53 |
-
("Machine learning is great", "AI is wonderful"),
|
| 54 |
-
("Machine learning is great", "I love pizza"),
|
| 55 |
-
("The cat sat on the mat", "A feline rested on the rug"),
|
| 56 |
-
]
|
| 57 |
-
|
| 58 |
-
for text1, text2 in pairs:
|
| 59 |
-
similarity = model.similarity(text1, text2)
|
| 60 |
-
tag = "MATCH" if similarity > 0.5 else " LOW"
|
| 61 |
-
print(f" [{tag}] {similarity:.4f} | '{text1}' vs '{text2}'")
|
| 62 |
-
|
| 63 |
-
# -------------------------------------------------------------------------
|
| 64 |
-
# Example 3: Pairwise similarity matrix
|
| 65 |
-
# -------------------------------------------------------------------------
|
| 66 |
-
print("\n" + "-" * 40)
|
| 67 |
-
print("Example 3: Pairwise Similarity Matrix")
|
| 68 |
-
print("-" * 40)
|
| 69 |
-
|
| 70 |
-
texts_a = ["Machine learning", "Deep learning", "Natural language"]
|
| 71 |
-
texts_b = ["AI models", "Neural networks", "Text processing"]
|
| 72 |
-
|
| 73 |
-
similarity_matrix = model.pairwise_similarity(texts_a, texts_b)
|
| 74 |
-
|
| 75 |
-
print("\nSimilarity Matrix:")
|
| 76 |
-
print(" ", " ".join(f"{t[:10]:>10}" for t in texts_b))
|
| 77 |
-
for i, text in enumerate(texts_a):
|
| 78 |
-
row = " ".join(f"{similarity_matrix[i, j]:>10.4f}" for j in range(len(texts_b)))
|
| 79 |
-
print(f"{text[:12]:>12}: {row}")
|
| 80 |
-
|
| 81 |
-
print("\nDone.")
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
if __name__ == "__main__":
|
| 85 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
examples/clustering.py
DELETED
|
@@ -1,109 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
Text Clustering Example
|
| 3 |
-
=======================
|
| 4 |
-
Demonstrates how to cluster texts by semantic similarity using MiniEmbed.
|
| 5 |
-
|
| 6 |
-
The model encodes each text into a dense vector. K-Means clustering then
|
| 7 |
-
groups these vectors by proximity in the embedding space, even if the texts
|
| 8 |
-
share no common words.
|
| 9 |
-
"""
|
| 10 |
-
|
| 11 |
-
import sys
|
| 12 |
-
sys.path.insert(0, '..')
|
| 13 |
-
|
| 14 |
-
from src.inference import EmbeddingInference
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
def main():
|
| 18 |
-
print("=" * 60)
|
| 19 |
-
print("MiniEmbed - Text Clustering Example")
|
| 20 |
-
print("=" * 60)
|
| 21 |
-
|
| 22 |
-
# Load the model
|
| 23 |
-
print("\nLoading model...")
|
| 24 |
-
model = EmbeddingInference.from_pretrained("../models/mini")
|
| 25 |
-
print("Model loaded.\n")
|
| 26 |
-
|
| 27 |
-
# -------------------------------------------------------------------------
|
| 28 |
-
# Text collection (mixed topics)
|
| 29 |
-
# -------------------------------------------------------------------------
|
| 30 |
-
texts = [
|
| 31 |
-
# Technology
|
| 32 |
-
"Python is a versatile programming language",
|
| 33 |
-
"Machine learning models learn from data",
|
| 34 |
-
"JavaScript is used for web development",
|
| 35 |
-
"Neural networks process information like the brain",
|
| 36 |
-
"Software engineering involves designing systems",
|
| 37 |
-
|
| 38 |
-
# Food
|
| 39 |
-
"Pizza is my favorite Italian dish",
|
| 40 |
-
"Sushi is a traditional Japanese cuisine",
|
| 41 |
-
"Tacos are delicious Mexican street food",
|
| 42 |
-
"Pasta with marinara sauce is comforting",
|
| 43 |
-
"Ramen noodles are popular in Japan",
|
| 44 |
-
|
| 45 |
-
# Sports
|
| 46 |
-
"Football is the most popular sport worldwide",
|
| 47 |
-
"Basketball requires teamwork and skill",
|
| 48 |
-
"Tennis is an exciting individual sport",
|
| 49 |
-
"Swimming is great for cardiovascular health",
|
| 50 |
-
"Soccer World Cup attracts billions of viewers",
|
| 51 |
-
|
| 52 |
-
# Nature
|
| 53 |
-
"Mountains offer breathtaking scenic views",
|
| 54 |
-
"Oceans cover most of the Earth's surface",
|
| 55 |
-
"Forests are home to diverse wildlife",
|
| 56 |
-
"Rivers provide fresh water to ecosystems",
|
| 57 |
-
"Deserts have extreme temperature variations",
|
| 58 |
-
]
|
| 59 |
-
|
| 60 |
-
print(f"Text Collection: {len(texts)} texts (4 topics)")
|
| 61 |
-
|
| 62 |
-
# -------------------------------------------------------------------------
|
| 63 |
-
# Cluster texts
|
| 64 |
-
# -------------------------------------------------------------------------
|
| 65 |
-
print("\nClustering texts into 4 groups...")
|
| 66 |
-
|
| 67 |
-
result = model.cluster_texts(texts, n_clusters=4)
|
| 68 |
-
|
| 69 |
-
# -------------------------------------------------------------------------
|
| 70 |
-
# Display results
|
| 71 |
-
# -------------------------------------------------------------------------
|
| 72 |
-
print("\n" + "=" * 60)
|
| 73 |
-
print("Clustering Results")
|
| 74 |
-
print("=" * 60)
|
| 75 |
-
|
| 76 |
-
for cluster_id in sorted(result['texts_by_cluster'].keys()):
|
| 77 |
-
cluster_texts = result['texts_by_cluster'][cluster_id]
|
| 78 |
-
|
| 79 |
-
print(f"\n Cluster {cluster_id + 1} ({len(cluster_texts)} texts)")
|
| 80 |
-
print("-" * 40)
|
| 81 |
-
|
| 82 |
-
for text in cluster_texts:
|
| 83 |
-
print(f" - {text}")
|
| 84 |
-
|
| 85 |
-
# -------------------------------------------------------------------------
|
| 86 |
-
# Evaluate clustering (simple check)
|
| 87 |
-
# -------------------------------------------------------------------------
|
| 88 |
-
print("\n" + "=" * 60)
|
| 89 |
-
print("Clustering Analysis")
|
| 90 |
-
print("=" * 60)
|
| 91 |
-
|
| 92 |
-
# Expected groupings (approximate)
|
| 93 |
-
expected = {
|
| 94 |
-
"Technology": texts[0:5],
|
| 95 |
-
"Food": texts[5:10],
|
| 96 |
-
"Sports": texts[10:15],
|
| 97 |
-
"Nature": texts[15:20],
|
| 98 |
-
}
|
| 99 |
-
|
| 100 |
-
print("\nLabels assigned to each text:")
|
| 101 |
-
for i, (text, label) in enumerate(zip(texts, result['labels'])):
|
| 102 |
-
topic = list(expected.keys())[i // 5]
|
| 103 |
-
print(f" [{label}] ({topic}) {text[:50]}...")
|
| 104 |
-
|
| 105 |
-
print("\nDone.")
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
if __name__ == "__main__":
|
| 109 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
examples/semantic_search.py
DELETED
|
@@ -1,108 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
Semantic Search Example
|
| 3 |
-
=======================
|
| 4 |
-
Demonstrates how to use MiniEmbed for document retrieval.
|
| 5 |
-
|
| 6 |
-
The model encodes a query and a corpus of documents into the same vector space,
|
| 7 |
-
then ranks documents by cosine similarity to the query. This finds results based
|
| 8 |
-
on meaning, not keyword overlap.
|
| 9 |
-
"""
|
| 10 |
-
|
| 11 |
-
import sys
|
| 12 |
-
sys.path.insert(0, '..')
|
| 13 |
-
|
| 14 |
-
from src.inference import EmbeddingInference
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
def main():
|
| 18 |
-
print("=" * 60)
|
| 19 |
-
print("MiniEmbed - Semantic Search Example")
|
| 20 |
-
print("=" * 60)
|
| 21 |
-
|
| 22 |
-
# Load the model
|
| 23 |
-
print("\nLoading model...")
|
| 24 |
-
model = EmbeddingInference.from_pretrained("../models/mini")
|
| 25 |
-
print("Model loaded.\n")
|
| 26 |
-
|
| 27 |
-
# -------------------------------------------------------------------------
|
| 28 |
-
# Document collection
|
| 29 |
-
# -------------------------------------------------------------------------
|
| 30 |
-
documents = [
|
| 31 |
-
"Python is a high-level programming language known for its simplicity",
|
| 32 |
-
"Machine learning algorithms can learn patterns from data",
|
| 33 |
-
"The weather today is sunny with a high of 75 degrees",
|
| 34 |
-
"Neural networks are computational models inspired by the brain",
|
| 35 |
-
"JavaScript is widely used for web development",
|
| 36 |
-
"Deep learning has revolutionized computer vision and NLP",
|
| 37 |
-
"Cats are popular pets known for their independence",
|
| 38 |
-
"TensorFlow and PyTorch are popular deep learning frameworks",
|
| 39 |
-
"The stock market showed strong gains today",
|
| 40 |
-
"Natural language processing helps computers understand text"
|
| 41 |
-
]
|
| 42 |
-
|
| 43 |
-
print(f"Document Collection: {len(documents)} documents")
|
| 44 |
-
for i, doc in enumerate(documents, 1):
|
| 45 |
-
print(f" {i}. {doc[:60]}...")
|
| 46 |
-
|
| 47 |
-
# -------------------------------------------------------------------------
|
| 48 |
-
# Search queries
|
| 49 |
-
# -------------------------------------------------------------------------
|
| 50 |
-
queries = [
|
| 51 |
-
"How do AI systems learn from examples?",
|
| 52 |
-
"What programming language is good for beginners?",
|
| 53 |
-
"Tell me about artificial neural networks",
|
| 54 |
-
]
|
| 55 |
-
|
| 56 |
-
print("\n" + "=" * 60)
|
| 57 |
-
print("Search Results")
|
| 58 |
-
print("=" * 60)
|
| 59 |
-
|
| 60 |
-
for query in queries:
|
| 61 |
-
print(f"\n Query: \"{query}\"")
|
| 62 |
-
print("-" * 50)
|
| 63 |
-
|
| 64 |
-
results = model.search(query, documents, top_k=3)
|
| 65 |
-
|
| 66 |
-
for r in results:
|
| 67 |
-
score = r['score']
|
| 68 |
-
if score > 0.6:
|
| 69 |
-
tag = "[HIGH]"
|
| 70 |
-
elif score > 0.4:
|
| 71 |
-
tag = "[ MED]"
|
| 72 |
-
else:
|
| 73 |
-
tag = "[ LOW]"
|
| 74 |
-
|
| 75 |
-
print(f" {tag} #{r['rank']} (score: {score:.4f})")
|
| 76 |
-
print(f" {r['text']}")
|
| 77 |
-
|
| 78 |
-
# -------------------------------------------------------------------------
|
| 79 |
-
# Interactive search (optional)
|
| 80 |
-
# -------------------------------------------------------------------------
|
| 81 |
-
print("\n" + "=" * 60)
|
| 82 |
-
print("Interactive Search")
|
| 83 |
-
print("=" * 60)
|
| 84 |
-
print("Enter your own queries (type 'quit' to exit):\n")
|
| 85 |
-
|
| 86 |
-
while True:
|
| 87 |
-
try:
|
| 88 |
-
query = input(" Query: ").strip()
|
| 89 |
-
if query.lower() in ['quit', 'exit', 'q']:
|
| 90 |
-
break
|
| 91 |
-
if not query:
|
| 92 |
-
continue
|
| 93 |
-
|
| 94 |
-
results = model.search(query, documents, top_k=3)
|
| 95 |
-
|
| 96 |
-
print("\n Results:")
|
| 97 |
-
for r in results:
|
| 98 |
-
print(f" - [{r['score']:.3f}] {r['text'][:70]}...")
|
| 99 |
-
print()
|
| 100 |
-
|
| 101 |
-
except (KeyboardInterrupt, EOFError):
|
| 102 |
-
break
|
| 103 |
-
|
| 104 |
-
print("\nDone.")
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
if __name__ == "__main__":
|
| 108 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
models/mini/model.pt → model.pt
RENAMED
|
File without changes
|
models/mini/model.safetensors → model.safetensors
RENAMED
|
File without changes
|
models/large/README.md
DELETED
|
@@ -1,5 +0,0 @@
|
|
| 1 |
-
# MiniEmbed - Large
|
| 2 |
-
|
| 3 |
-
Full-scale variant for maximum accuracy on complex semantic tasks.
|
| 4 |
-
|
| 5 |
-
Coming soon...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
models/medium/README.md
DELETED
|
@@ -1,5 +0,0 @@
|
|
| 1 |
-
# MiniEmbed - Medium
|
| 2 |
-
|
| 3 |
-
Balanced variant offering higher accuracy with moderate compute requirements.
|
| 4 |
-
|
| 5 |
-
Coming soon...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
models/product/README.md
DELETED
|
@@ -1,5 +0,0 @@
|
|
| 1 |
-
# MiniEmbed - Product
|
| 2 |
-
|
| 3 |
-
Fine-tuned variant of Mini, specialized for high-accuracy product matching.
|
| 4 |
-
|
| 5 |
-
Coming soon...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
models/small/README.md
DELETED
|
@@ -1,5 +0,0 @@
|
|
| 1 |
-
# MiniEmbed - Small
|
| 2 |
-
|
| 3 |
-
A larger variant with increased capacity for general-purpose embeddings.
|
| 4 |
-
|
| 5 |
-
Coming soon...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
requirements.txt
DELETED
|
@@ -1,14 +0,0 @@
|
|
| 1 |
-
# Core
|
| 2 |
-
torch>=2.0.0
|
| 3 |
-
numpy>=1.21.0
|
| 4 |
-
tqdm>=4.64.0
|
| 5 |
-
|
| 6 |
-
# Demo UI
|
| 7 |
-
streamlit>=1.30.0
|
| 8 |
-
plotly>=5.0.0
|
| 9 |
-
|
| 10 |
-
# Optional (for clustering, CSV processing, & Benchmarking)
|
| 11 |
-
scikit-learn>=1.0.0
|
| 12 |
-
pandas>=2.0.0
|
| 13 |
-
psutil>=5.9.0
|
| 14 |
-
sentence-transformers>=2.2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/__pycache__/__init__.cpython-313.pyc
ADDED
|
Binary file (500 Bytes). View file
|
|
|
src/__pycache__/inference.cpython-313.pyc
ADDED
|
Binary file (14.7 kB). View file
|
|
|
src/__pycache__/model.cpython-313.pyc
ADDED
|
Binary file (15 kB). View file
|
|
|
src/__pycache__/tokenizer.cpython-313.pyc
ADDED
|
Binary file (7.06 kB). View file
|
|
|
models/mini/tokenizer.json → tokenizer.json
RENAMED
|
File without changes
|
models/mini/training_info.json → training_info.json
RENAMED
|
File without changes
|