Update README.md
Browse files
README.md
CHANGED
|
@@ -36,14 +36,14 @@ datasets:
|
|
| 36 |
- thebajajra/Ecom-niverse
|
| 37 |
---
|
| 38 |
|
| 39 |
-
#
|
| 40 |
|
| 41 |
-
[](https://huggingface.co/collections/thebajajra/rexbert-68cc4b1b8a272f6beae5ebb8)
|
| 43 |
[](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
|
| 44 |
-
[](https://github.com/bajajra/
|
| 45 |
|
| 46 |
-
> **TL;DR**:
|
| 47 |
|
| 48 |
---
|
| 49 |
|
|
@@ -73,7 +73,7 @@ datasets:
|
|
| 73 |
import torch
|
| 74 |
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, pipeline
|
| 75 |
|
| 76 |
-
MODEL_ID = "thebajajra/
|
| 77 |
|
| 78 |
# Tokenizer
|
| 79 |
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
|
|
@@ -113,13 +113,30 @@ emb = (out.last_hidden_state * inputs.attention_mask.unsqueeze(-1)).sum(dim=1) /
|
|
| 113 |
|
| 114 |
## Model Description
|
| 115 |
|
| 116 |
-
|
| 117 |
|
| 118 |
---
|
| 119 |
|
| 120 |
## Training Recipe
|
| 121 |
|
|
|
|
| 122 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
---
|
| 125 |
|
|
@@ -158,17 +175,17 @@ By focusing on these domains, we narrow the search space to parts of the web dat
|
|
| 158 |
---
|
| 159 |
## Evaluation
|
| 160 |
|
| 161 |
-
|
| 162 |
|
| 163 |

|
| 164 |
|
| 165 |
> With 2–3x fewer parameters, RexBERT surpasses the performance of the ModernBERT series.
|
| 166 |
-
|
| 167 |
### Semantic Similarity
|
| 168 |
|
| 169 |

|
| 170 |
|
| 171 |
-
>
|
| 172 |
|
| 173 |
---
|
| 174 |
|
|
@@ -178,8 +195,8 @@ By focusing on these domains, we narrow the search space to parts of the web dat
|
|
| 178 |
```python
|
| 179 |
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
|
| 180 |
|
| 181 |
-
m = AutoModelForMaskedLM.from_pretrained("thebajajra/
|
| 182 |
-
t = AutoTokenizer.from_pretrained("thebajajra/
|
| 183 |
fill = pipeline("fill-mask", model=m, tokenizer=t)
|
| 184 |
|
| 185 |
fill("Best [MASK] headphones under $100.")
|
|
@@ -190,8 +207,8 @@ fill("Best [MASK] headphones under $100.")
|
|
| 190 |
import torch
|
| 191 |
from transformers import AutoTokenizer, AutoModel
|
| 192 |
|
| 193 |
-
tok = AutoTokenizer.from_pretrained("thebajajra/
|
| 194 |
-
enc = AutoModel.from_pretrained("thebajajra/
|
| 195 |
|
| 196 |
texts = ["nike air zoom pegasus 40", "running shoes pegasus zoom nike"]
|
| 197 |
batch = tok(texts, padding=True, truncation=True, return_tensors="pt")
|
|
@@ -209,8 +226,8 @@ emb = torch.nn.functional.normalize(emb, p=2, dim=1)
|
|
| 209 |
```python
|
| 210 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
|
| 211 |
|
| 212 |
-
tok = AutoTokenizer.from_pretrained("thebajajra/
|
| 213 |
-
model = AutoModelForSequenceClassification.from_pretrained("thebajajra/
|
| 214 |
|
| 215 |
# Prepare your Dataset objects: train_ds, val_ds (text→label)
|
| 216 |
args = TrainingArguments(
|
|
@@ -232,7 +249,7 @@ trainer.train()
|
|
| 232 |
|
| 233 |
## Model Architecture & Compatibility
|
| 234 |
|
| 235 |
-
- **Architecture:** Encoder-only,
|
| 236 |
- **Libraries:** Works with **🤗 Transformers**; supports **fill-mask** and **feature-extraction** pipelines.
|
| 237 |
- **Context length:** Increased during the **Context Extension** phase—ensure `max_position_embeddings` in `config.json` matches your desired max length.
|
| 238 |
- **Files:** `config.json`, tokenizer files, and (optionally) heads for MLM or classification.
|
|
@@ -251,11 +268,11 @@ trainer.train()
|
|
| 251 |
|
| 252 |
## License
|
| 253 |
|
| 254 |
-
- **License:** `
|
| 255 |
---
|
| 256 |
|
| 257 |
## Maintainers & Contact
|
| 258 |
|
| 259 |
-
- **Authors:** [Rahul Bajaj](https://huggingface.co/thebajajra)
|
| 260 |
|
| 261 |
---
|
|
|
|
| 36 |
- thebajajra/Ecom-niverse
|
| 37 |
---
|
| 38 |
|
| 39 |
+
# RexBERT-base
|
| 40 |
|
| 41 |
+
[](https://www.apache.org/licenses/LICENSE-2.0)
|
| 42 |
[](https://huggingface.co/collections/thebajajra/rexbert-68cc4b1b8a272f6beae5ebb8)
|
| 43 |
[](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
|
| 44 |
+
[](https://github.com/bajajra/RexBERT)
|
| 45 |
|
| 46 |
+
> **TL;DR**: An encoder-only transformer (ModernBERT-style) for **e-commerce** applications, trained in three phases—**Pre-training**, **Context Extension**, and **Decay**—to power product search, attribute extraction, classification, and embeddings use cases. The model has been trained on 2.3T+ tokens along with 350B+ e-commerce-specific tokens
|
| 47 |
|
| 48 |
---
|
| 49 |
|
|
|
|
| 73 |
import torch
|
| 74 |
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, pipeline
|
| 75 |
|
| 76 |
+
MODEL_ID = "thebajajra/RexBERT-base"
|
| 77 |
|
| 78 |
# Tokenizer
|
| 79 |
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
|
|
|
|
| 113 |
|
| 114 |
## Model Description
|
| 115 |
|
| 116 |
+
RexBERT-base is an **encoder-only**, 150M parameter transformer trained with a masked-language-modeling objective and optimized for **e-commerce related text**. The three-phase training curriculum improves general language understanding, extends context handling, and then **specializes** on a very large corpus of commerce data to capture domain-specific terminology and entity distributions.
|
| 117 |
|
| 118 |
---
|
| 119 |
|
| 120 |
## Training Recipe
|
| 121 |
|
| 122 |
+
RexBERT-base was trained in **three phases**:
|
| 123 |
|
| 124 |
+
1) **Pre-training**
|
| 125 |
+
General-purpose MLM pre-training on diverse English text for robust linguistic representations.
|
| 126 |
+
|
| 127 |
+
2) **Context Extension**
|
| 128 |
+
Continued training with **increased max sequence length** to better handle long product pages, concatenated attribute blocks, multi-turn queries, and facet strings. This preserves prior capabilities while expanding context handling.
|
| 129 |
+
|
| 130 |
+
3) **Decay on 350B+ e-commerce tokens**
|
| 131 |
+
Final specialization stage on **350B+ domain-specific tokens** (product catalogs, queries, reviews, taxonomy/attributes). Learning rate and sampling weights are annealed (decayed) to consolidate domain knowledge and stabilize performance on commerce tasks.
|
| 132 |
+
|
| 133 |
+
**Training details (fill in):**
|
| 134 |
+
- Optimizer / LR schedule: TODO
|
| 135 |
+
- Effective batch size / steps per phase: TODO
|
| 136 |
+
- Context lengths per phase (e.g., 512 → 1k/2k): TODO
|
| 137 |
+
- Tokenizer/vocab: TODO
|
| 138 |
+
- Hardware & wall-clock: TODO
|
| 139 |
+
- Checkpoint tags: TODO (e.g., `pretrain`, `ext`, `decay`)
|
| 140 |
|
| 141 |
---
|
| 142 |
|
|
|
|
| 175 |
---
|
| 176 |
## Evaluation
|
| 177 |
|
| 178 |
+
### Token Classification
|
| 179 |
|
| 180 |

|
| 181 |
|
| 182 |
> With 2–3x fewer parameters, RexBERT surpasses the performance of the ModernBERT series.
|
| 183 |
+
|
| 184 |
### Semantic Similarity
|
| 185 |
|
| 186 |

|
| 187 |
|
| 188 |
+
> RexBERT models outperform all the models in their parameter/size category.
|
| 189 |
|
| 190 |
---
|
| 191 |
|
|
|
|
| 195 |
```python
|
| 196 |
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
|
| 197 |
|
| 198 |
+
m = AutoModelForMaskedLM.from_pretrained("thebajajra/RexBERT-base")
|
| 199 |
+
t = AutoTokenizer.from_pretrained("thebajajra/RexBERT-base")
|
| 200 |
fill = pipeline("fill-mask", model=m, tokenizer=t)
|
| 201 |
|
| 202 |
fill("Best [MASK] headphones under $100.")
|
|
|
|
| 207 |
import torch
|
| 208 |
from transformers import AutoTokenizer, AutoModel
|
| 209 |
|
| 210 |
+
tok = AutoTokenizer.from_pretrained("thebajajra/RexBERT-base")
|
| 211 |
+
enc = AutoModel.from_pretrained("thebajajra/RexBERT-base")
|
| 212 |
|
| 213 |
texts = ["nike air zoom pegasus 40", "running shoes pegasus zoom nike"]
|
| 214 |
batch = tok(texts, padding=True, truncation=True, return_tensors="pt")
|
|
|
|
| 226 |
```python
|
| 227 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
|
| 228 |
|
| 229 |
+
tok = AutoTokenizer.from_pretrained("thebajajra/RexBERT-base")
|
| 230 |
+
model = AutoModelForSequenceClassification.from_pretrained("thebajajra/RexBERT-base", num_labels=NUM_LABELS)
|
| 231 |
|
| 232 |
# Prepare your Dataset objects: train_ds, val_ds (text→label)
|
| 233 |
args = TrainingArguments(
|
|
|
|
| 249 |
|
| 250 |
## Model Architecture & Compatibility
|
| 251 |
|
| 252 |
+
- **Architecture:** Encoder-only, ModernBERT-style **base** model.
|
| 253 |
- **Libraries:** Works with **🤗 Transformers**; supports **fill-mask** and **feature-extraction** pipelines.
|
| 254 |
- **Context length:** Increased during the **Context Extension** phase—ensure `max_position_embeddings` in `config.json` matches your desired max length.
|
| 255 |
- **Files:** `config.json`, tokenizer files, and (optionally) heads for MLM or classification.
|
|
|
|
| 268 |
|
| 269 |
## License
|
| 270 |
|
| 271 |
+
- **License:** `apache-2.0`.
|
| 272 |
---
|
| 273 |
|
| 274 |
## Maintainers & Contact
|
| 275 |
|
| 276 |
+
- **Authors:** [Rahul Bajaj](https://huggingface.co/thebajajra), [Anuj Garg ](https://huggingface.co/anujga)
|
| 277 |
|
| 278 |
---
|