surazbhandari commited on
Commit
b9e9219
·
verified ·
1 Parent(s): 3403783

Sync from GitHub Actions

Browse files
Files changed (3) hide show
  1. .gitattributes +2 -0
  2. MODEL_CARD.md +173 -0
  3. README.md +7 -3
.gitattributes ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ model.pt filter=lfs diff=lfs merge=lfs -text
2
+ model.safetensors filter=lfs diff=lfs merge=lfs -text
MODEL_CARD.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - text-embedding
6
+ - sentence-similarity
7
+ - semantic-search
8
+ - product-matching
9
+ - transformer
10
+ - pytorch
11
+ - from-scratch
12
+ library_name: pytorch
13
+ pipeline_tag: sentence-similarity
14
+ model-index:
15
+ - name: MiniEmbed-Mini
16
+ results: []
17
+ ---
18
+
19
+ # MiniEmbed: Tiny, Powerful Embedding Models from Scratch
20
+
21
+ **MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.
22
+
23
+ **GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation)
24
+
25
+ | Spec | Value |
26
+ |---|---|
27
+ | Parameters | ~10.8M |
28
+ | Model Size | ~42 MB |
29
+ | Embedding Dim | 256 |
30
+ | Vocab Size | 30,000 |
31
+ | Max Seq Length | 128 tokens |
32
+ | Architecture | 4-layer Transformer Encoder |
33
+ | Pooling | Mean Pooling + L2 Normalization |
34
+ | Training Loss | MNRL (Multiple Negatives Ranking Loss) |
35
+ | Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) |
36
+
37
+ ## Quick Start
38
+
39
+ ```bash
40
+ pip install torch numpy scikit-learn huggingface_hub
41
+ ```
42
+
43
+ ```python
44
+ from huggingface_hub import snapshot_download
45
+
46
+ # Download model (one-time)
47
+ model_dir = snapshot_download("surazbhandari/miniembed")
48
+
49
+ # Add src to path
50
+ import sys
51
+ sys.path.insert(0, model_dir)
52
+
53
+ from src.inference import EmbeddingInference
54
+
55
+ # Load -- just like sentence-transformers!
56
+ model = EmbeddingInference.from_pretrained("surazbhandari/miniembed")
57
+
58
+ # 1. Similarity
59
+ score = model.similarity("Machine learning is great", "AI is wonderful")
60
+ print(f"Similarity: {score:.4f}") # 0.4287
61
+
62
+ # 2. Normal Embeddings
63
+ embeddings = model.encode(["Machine learning is great", "AI is wonderful"])
64
+
65
+ # 3. Manual Cosine Similarity
66
+ # Since embeddings are L2-normalized, dot product is cosine similarity
67
+ import numpy as np
68
+ score = np.dot(embeddings[0], embeddings[1])
69
+ print(f"Similarity: {score:.4f}")
70
+
71
+ # Semantic Search
72
+ docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
73
+ results = model.search("deep learning frameworks", docs, top_k=2)
74
+ for r in results:
75
+ print(f" [{r['score']:.3f}] {r['text']}")
76
+ # [0.498] Neural networks learn patterns
77
+ # [0.413] Python is great for AI
78
+
79
+ # Clustering
80
+ result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
81
+ # Cluster 1: ['Pizza is food']
82
+ # Cluster 2: ['ML is cool', 'AI rocks']
83
+ ```
84
+
85
+ ## Also Available via GitHub
86
+
87
+ ```bash
88
+ git clone https://github.com/bhandarisuraz/miniembed.git
89
+ cd miniembed
90
+ pip install -r requirements.txt
91
+
92
+ python -c "
93
+ from src.inference import EmbeddingInference
94
+ model = EmbeddingInference.from_pretrained('models/mini')
95
+ print(model.similarity('hello world', 'hi there'))
96
+ "
97
+ ```
98
+
99
+ ## Capabilities
100
+
101
+ - **Semantic Search** -- Find meaning-based matches, not keyword overlap.
102
+ - **Re-Ranking** -- Sort candidates by true semantic relevance.
103
+ - **Clustering** -- Group texts into logical categories automatically.
104
+ - **Product Matching** -- Match items across platforms with messy titles.
105
+
106
+ ## Architecture
107
+
108
+ Custom 4-layer Transformer encoder built from first principles:
109
+
110
+ - Token Embedding (30K vocab) + Sinusoidal Positional Encoding
111
+ - 4x Pre-LayerNorm Transformer Encoder Layers
112
+ - Multi-Head Self-Attention (4 heads, d_k=64)
113
+ - Position-wise Feed-Forward (GELU activation, d_ff=1024)
114
+ - Mean Pooling over non-padded tokens
115
+ - L2 Normalization (unit hypersphere projection)
116
+
117
+ ## Training
118
+
119
+ Trained on ~3.8 million text pairs from public datasets:
120
+
121
+ | Dataset | Type |
122
+ |---|---|
123
+ | Natural Questions (NQ) | Q&A / General |
124
+ | GooAQ | Knowledge Search |
125
+ | WDC Product Matching | E-commerce |
126
+ | ECInstruct | E-commerce Tasks |
127
+ | MS MARCO | Web Search |
128
+
129
+ **Training details:**
130
+ - Training time: ~49 hours
131
+ - Final loss: 0.0748
132
+ - Optimizer: AdamW
133
+ - Batch size: 256
134
+
135
+ ## Files
136
+
137
+ ```
138
+ surazbhandari/miniembed
139
+ |-- README.md # This model card
140
+ |-- config.json # Architecture config
141
+ |-- model.safetensors # Pre-trained weights (Safe & Fast)
142
+ |-- model.pt # Pre-trained weights (Legacy PyTorch)
143
+ |-- tokenizer.json # 30K word-level vocabulary
144
+ |-- training_info.json # Training metadata
145
+ |-- src/
146
+ |-- __init__.py
147
+ |-- model.py # Full architecture code
148
+ |-- tokenizer.py # Tokenizer implementation
149
+ |-- inference.py # High-level API (supports HF auto-download)
150
+ ```
151
+
152
+ ## Limitations
153
+
154
+ - Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
155
+ - 128 token max sequence length
156
+ - Trained primarily on English text
157
+ - Best suited for short-form text (queries, product titles, sentences)
158
+
159
+ ## Citation
160
+
161
+ ```bibtex
162
+ @software{Bhandari_MiniEmbed_2026,
163
+ author = {Bhandari, Suraj},
164
+ title = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}},
165
+ url = {https://github.com/bhandarisuraz/miniembed},
166
+ version = {1.0.0},
167
+ year = {2026}
168
+ }
169
+ ```
170
+
171
+ ## License
172
+
173
+ MIT
README.md CHANGED
@@ -61,10 +61,14 @@ print(f"Similarity: {score:.4f}") # 0.4287
61
 
62
  # 2. Normal Embeddings
63
  embeddings = model.encode(["Machine learning is great", "AI is wonderful"])
 
 
 
64
  import numpy as np
65
- manual_score = np.dot(embeddings[0], embeddings[1]) # Dot product = Cosine Similarity
 
66
 
67
- # 3. Semantic Search
68
  docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
69
  results = model.search("deep learning frameworks", docs, top_k=2)
70
  for r in results:
@@ -72,7 +76,7 @@ for r in results:
72
  # [0.498] Neural networks learn patterns
73
  # [0.413] Python is great for AI
74
 
75
- # 4. Clustering
76
  result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
77
  # Cluster 1: ['Pizza is food']
78
  # Cluster 2: ['ML is cool', 'AI rocks']
 
61
 
62
  # 2. Normal Embeddings
63
  embeddings = model.encode(["Machine learning is great", "AI is wonderful"])
64
+
65
+ # 3. Manual Cosine Similarity
66
+ # Since embeddings are L2-normalized, dot product is cosine similarity
67
  import numpy as np
68
+ score = np.dot(embeddings[0], embeddings[1])
69
+ print(f"Similarity: {score:.4f}")
70
 
71
+ # Semantic Search
72
  docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
73
  results = model.search("deep learning frameworks", docs, top_k=2)
74
  for r in results:
 
76
  # [0.498] Neural networks learn patterns
77
  # [0.413] Python is great for AI
78
 
79
+ # Clustering
80
  result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
81
  # Cluster 1: ['Pizza is food']
82
  # Cluster 2: ['ML is cool', 'AI rocks']