thebajajra commited on
Commit
eb44e24
·
verified ·
1 Parent(s): 39c588e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -18
README.md CHANGED
@@ -36,14 +36,14 @@ datasets:
36
  - thebajajra/Ecom-niverse
37
  ---
38
 
39
- # RexGemma-2048
40
 
41
- [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://mit-license.org)
42
  [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-Models-red)](https://huggingface.co/collections/thebajajra/rexbert-68cc4b1b8a272f6beae5ebb8)
43
  [![Data](https://img.shields.io/badge/🤗%20Training%20Data-Ecomniverse-yellow)](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
44
- [![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/bajajra/RexGemma)
45
 
46
- > **TL;DR**: Gemma3-270M decoder converted into encoder with 2048 sequence length and 100M non-embedding parameters to power product search, attribute extraction, classification, and embeddings use cases. The model has been trained on 350B+ e-commerce-specific tokens
47
 
48
  ---
49
 
@@ -73,7 +73,7 @@ datasets:
73
  import torch
74
  from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, pipeline
75
 
76
- MODEL_ID = "thebajajra/RexGemma-2048"
77
 
78
  # Tokenizer
79
  tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
@@ -113,13 +113,30 @@ emb = (out.last_hidden_state * inputs.attention_mask.unsqueeze(-1)).sum(dim=1) /
113
 
114
  ## Model Description
115
 
116
- RexGemma-2048 is an **encoder-only**, 100M parameter transformer trained with a masked-language-modeling objective and optimized for **e-commerce related text**.
117
 
118
  ---
119
 
120
  ## Training Recipe
121
 
 
122
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
124
  ---
125
 
@@ -158,17 +175,17 @@ By focusing on these domains, we narrow the search space to parts of the web dat
158
  ---
159
  ## Evaluation
160
 
161
- <!-- ### Token Classification
162
 
163
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/DuUWO7SyzxJsN53dOSV60.png)
164
 
165
  > With 2–3x fewer parameters, RexBERT surpasses the performance of the ModernBERT series.
166
- -->
167
  ### Semantic Similarity
168
 
169
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/CPrf6J1ioUGzr6vohJ4xU.png)
170
 
171
- > RexGemma models outperform all the models in their parameter/size category including RexBERT family of models.
172
 
173
  ---
174
 
@@ -178,8 +195,8 @@ By focusing on these domains, we narrow the search space to parts of the web dat
178
  ```python
179
  from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
180
 
181
- m = AutoModelForMaskedLM.from_pretrained("thebajajra/RexGemma-2048")
182
- t = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
183
  fill = pipeline("fill-mask", model=m, tokenizer=t)
184
 
185
  fill("Best [MASK] headphones under $100.")
@@ -190,8 +207,8 @@ fill("Best [MASK] headphones under $100.")
190
  import torch
191
  from transformers import AutoTokenizer, AutoModel
192
 
193
- tok = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
194
- enc = AutoModel.from_pretrained("thebajajra/RexGemma-2048")
195
 
196
  texts = ["nike air zoom pegasus 40", "running shoes pegasus zoom nike"]
197
  batch = tok(texts, padding=True, truncation=True, return_tensors="pt")
@@ -209,8 +226,8 @@ emb = torch.nn.functional.normalize(emb, p=2, dim=1)
209
  ```python
210
  from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
211
 
212
- tok = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
213
- model = AutoModelForSequenceClassification.from_pretrained("thebajajra/RexGemma-2048", num_labels=NUM_LABELS)
214
 
215
  # Prepare your Dataset objects: train_ds, val_ds (text→label)
216
  args = TrainingArguments(
@@ -232,7 +249,7 @@ trainer.train()
232
 
233
  ## Model Architecture & Compatibility
234
 
235
- - **Architecture:** Encoder-only, Gemma3-270M backbone model.
236
  - **Libraries:** Works with **🤗 Transformers**; supports **fill-mask** and **feature-extraction** pipelines.
237
  - **Context length:** Increased during the **Context Extension** phase—ensure `max_position_embeddings` in `config.json` matches your desired max length.
238
  - **Files:** `config.json`, tokenizer files, and (optionally) heads for MLM or classification.
@@ -251,11 +268,11 @@ trainer.train()
251
 
252
  ## License
253
 
254
- - **License:** `MIT`.
255
  ---
256
 
257
  ## Maintainers & Contact
258
 
259
- - **Authors:** [Rahul Bajaj](https://huggingface.co/thebajajra)
260
 
261
  ---
 
36
  - thebajajra/Ecom-niverse
37
  ---
38
 
39
+ # RexBERT-base
40
 
41
+ [![License: Apache2.0](https://img.shields.io/badge/License-Apache2.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
42
  [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-Models-red)](https://huggingface.co/collections/thebajajra/rexbert-68cc4b1b8a272f6beae5ebb8)
43
  [![Data](https://img.shields.io/badge/🤗%20Training%20Data-Ecomniverse-yellow)](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
44
+ [![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/bajajra/RexBERT)
45
 
46
+ > **TL;DR**: An encoder-only transformer (ModernBERT-style) for **e-commerce** applications, trained in three phases—**Pre-training**, **Context Extension**, and **Decay**—to power product search, attribute extraction, classification, and embeddings use cases. The model has been trained on 2.3T+ tokens along with 350B+ e-commerce-specific tokens
47
 
48
  ---
49
 
 
73
  import torch
74
  from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, pipeline
75
 
76
+ MODEL_ID = "thebajajra/RexBERT-base"
77
 
78
  # Tokenizer
79
  tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
 
113
 
114
  ## Model Description
115
 
116
+ RexBERT-base is an **encoder-only**, 150M parameter transformer trained with a masked-language-modeling objective and optimized for **e-commerce related text**. The three-phase training curriculum improves general language understanding, extends context handling, and then **specializes** on a very large corpus of commerce data to capture domain-specific terminology and entity distributions.
117
 
118
  ---
119
 
120
  ## Training Recipe
121
 
122
+ RexBERT-base was trained in **three phases**:
123
 
124
+ 1) **Pre-training**
125
+ General-purpose MLM pre-training on diverse English text for robust linguistic representations.
126
+
127
+ 2) **Context Extension**
128
+ Continued training with **increased max sequence length** to better handle long product pages, concatenated attribute blocks, multi-turn queries, and facet strings. This preserves prior capabilities while expanding context handling.
129
+
130
+ 3) **Decay on 350B+ e-commerce tokens**
131
+ Final specialization stage on **350B+ domain-specific tokens** (product catalogs, queries, reviews, taxonomy/attributes). Learning rate and sampling weights are annealed (decayed) to consolidate domain knowledge and stabilize performance on commerce tasks.
132
+
133
+ **Training details (fill in):**
134
+ - Optimizer / LR schedule: TODO
135
+ - Effective batch size / steps per phase: TODO
136
+ - Context lengths per phase (e.g., 512 → 1k/2k): TODO
137
+ - Tokenizer/vocab: TODO
138
+ - Hardware & wall-clock: TODO
139
+ - Checkpoint tags: TODO (e.g., `pretrain`, `ext`, `decay`)
140
 
141
  ---
142
 
 
175
  ---
176
  ## Evaluation
177
 
178
+ ### Token Classification
179
 
180
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/DuUWO7SyzxJsN53dOSV60.png)
181
 
182
  > With 2–3x fewer parameters, RexBERT surpasses the performance of the ModernBERT series.
183
+
184
  ### Semantic Similarity
185
 
186
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/CPrf6J1ioUGzr6vohJ4xU.png)
187
 
188
+ > RexBERT models outperform all the models in their parameter/size category.
189
 
190
  ---
191
 
 
195
  ```python
196
  from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
197
 
198
+ m = AutoModelForMaskedLM.from_pretrained("thebajajra/RexBERT-base")
199
+ t = AutoTokenizer.from_pretrained("thebajajra/RexBERT-base")
200
  fill = pipeline("fill-mask", model=m, tokenizer=t)
201
 
202
  fill("Best [MASK] headphones under $100.")
 
207
  import torch
208
  from transformers import AutoTokenizer, AutoModel
209
 
210
+ tok = AutoTokenizer.from_pretrained("thebajajra/RexBERT-base")
211
+ enc = AutoModel.from_pretrained("thebajajra/RexBERT-base")
212
 
213
  texts = ["nike air zoom pegasus 40", "running shoes pegasus zoom nike"]
214
  batch = tok(texts, padding=True, truncation=True, return_tensors="pt")
 
226
  ```python
227
  from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
228
 
229
+ tok = AutoTokenizer.from_pretrained("thebajajra/RexBERT-base")
230
+ model = AutoModelForSequenceClassification.from_pretrained("thebajajra/RexBERT-base", num_labels=NUM_LABELS)
231
 
232
  # Prepare your Dataset objects: train_ds, val_ds (text→label)
233
  args = TrainingArguments(
 
249
 
250
  ## Model Architecture & Compatibility
251
 
252
+ - **Architecture:** Encoder-only, ModernBERT-style **base** model.
253
  - **Libraries:** Works with **🤗 Transformers**; supports **fill-mask** and **feature-extraction** pipelines.
254
  - **Context length:** Increased during the **Context Extension** phase—ensure `max_position_embeddings` in `config.json` matches your desired max length.
255
  - **Files:** `config.json`, tokenizer files, and (optionally) heads for MLM or classification.
 
268
 
269
  ## License
270
 
271
+ - **License:** `apache-2.0`.
272
  ---
273
 
274
  ## Maintainers & Contact
275
 
276
+ - **Authors:** [Rahul Bajaj](https://huggingface.co/thebajajra), [Anuj Garg ](https://huggingface.co/anujga)
277
 
278
  ---