thebajajra
/

RexBERT-base

@@ -36,14 +36,14 @@ datasets:
 - thebajajra/Ecom-niverse
 ---
-# RexGemma-2048
-[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://mit-license.org)
 [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-Models-red)](https://huggingface.co/collections/thebajajra/rexbert-68cc4b1b8a272f6beae5ebb8)
 [![Data](https://img.shields.io/badge/🤗%20Training%20Data-Ecomniverse-yellow)](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
-[![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/bajajra/RexGemma)
-> **TL;DR**: Gemma3-270M decoder converted into encoder with 2048 sequence length and 100M non-embedding parameters to power product search, attribute extraction, classification, and embeddings use cases. The model has been trained on 350B+ e-commerce-specific tokens
 ---
@@ -73,7 +73,7 @@ datasets:
 import torch
 from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, pipeline
-MODEL_ID = "thebajajra/RexGemma-2048"
 # Tokenizer
 tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
@@ -113,13 +113,30 @@ emb = (out.last_hidden_state * inputs.attention_mask.unsqueeze(-1)).sum(dim=1) /
 ## Model Description
-RexGemma-2048 is an **encoder-only**, 100M parameter transformer trained with a masked-language-modeling objective and optimized for **e-commerce related text**.
 ---
 ## Training Recipe
 ---
@@ -158,17 +175,17 @@ By focusing on these domains, we narrow the search space to parts of the web dat
 ---
 ## Evaluation
-<!-- ### Token Classification
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/DuUWO7SyzxJsN53dOSV60.png)
 > With 2–3x fewer parameters, RexBERT surpasses the performance of the ModernBERT series.
--->
 ### Semantic Similarity
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/CPrf6J1ioUGzr6vohJ4xU.png)
-> RexGemma models outperform all the models in their parameter/size category including RexBERT family of models.
 ---
@@ -178,8 +195,8 @@ By focusing on these domains, we narrow the search space to parts of the web dat
 ```python
 from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
-m = AutoModelForMaskedLM.from_pretrained("thebajajra/RexGemma-2048")
-t = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
 fill = pipeline("fill-mask", model=m, tokenizer=t)
 fill("Best [MASK] headphones under $100.")
@@ -190,8 +207,8 @@ fill("Best [MASK] headphones under $100.")
 import torch
 from transformers import AutoTokenizer, AutoModel
-tok = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
-enc = AutoModel.from_pretrained("thebajajra/RexGemma-2048")
 texts = ["nike air zoom pegasus 40", "running shoes pegasus zoom nike"]
 batch = tok(texts, padding=True, truncation=True, return_tensors="pt")
@@ -209,8 +226,8 @@ emb = torch.nn.functional.normalize(emb, p=2, dim=1)
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
-tok = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
-model = AutoModelForSequenceClassification.from_pretrained("thebajajra/RexGemma-2048", num_labels=NUM_LABELS)
 # Prepare your Dataset objects: train_ds, val_ds (text→label)
 args = TrainingArguments(
@@ -232,7 +249,7 @@ trainer.train()
 ## Model Architecture & Compatibility
-- **Architecture:** Encoder-only, Gemma3-270M backbone model.
 - **Libraries:** Works with **🤗 Transformers**; supports **fill-mask** and **feature-extraction** pipelines.
 - **Context length:** Increased during the **Context Extension** phase—ensure `max_position_embeddings` in `config.json` matches your desired max length.
 - **Files:** `config.json`, tokenizer files, and (optionally) heads for MLM or classification.
@@ -251,11 +268,11 @@ trainer.train()
 ## License
-- **License:** `MIT`.
 ---
 ## Maintainers & Contact
-- **Authors:** [Rahul Bajaj](https://huggingface.co/thebajajra)
 ---

 - thebajajra/Ecom-niverse
 ---
+# RexBERT-base
+[![License: Apache2.0](https://img.shields.io/badge/License-Apache2.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
 [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-Models-red)](https://huggingface.co/collections/thebajajra/rexbert-68cc4b1b8a272f6beae5ebb8)
 [![Data](https://img.shields.io/badge/🤗%20Training%20Data-Ecomniverse-yellow)](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
+[![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/bajajra/RexBERT)
+> **TL;DR**: An encoder-only transformer (ModernBERT-style) for **e-commerce** applications, trained in three phases—**Pre-training**, **Context Extension**, and **Decay**—to power product search, attribute extraction, classification, and embeddings use cases. The model has been trained on 2.3T+ tokens along with 350B+ e-commerce-specific tokens
 ---
 import torch
 from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, pipeline
+MODEL_ID = "thebajajra/RexBERT-base"
 # Tokenizer
 tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
 ## Model Description
+RexBERT-base is an **encoder-only**, 150M parameter transformer trained with a masked-language-modeling objective and optimized for **e-commerce related text**. The three-phase training curriculum improves general language understanding, extends context handling, and then **specializes** on a very large corpus of commerce data to capture domain-specific terminology and entity distributions.
 ---
 ## Training Recipe
+RexBERT-base was trained in **three phases**:
+1) **Pre-training**
+   General-purpose MLM pre-training on diverse English text for robust linguistic representations.
+2) **Context Extension**
+   Continued training with **increased max sequence length** to better handle long product pages, concatenated attribute blocks, multi-turn queries, and facet strings. This preserves prior capabilities while expanding context handling.
+3) **Decay on 350B+ e-commerce tokens**
+   Final specialization stage on **350B+ domain-specific tokens** (product catalogs, queries, reviews, taxonomy/attributes). Learning rate and sampling weights are annealed (decayed) to consolidate domain knowledge and stabilize performance on commerce tasks.
+**Training details (fill in):**
+- Optimizer / LR schedule: TODO
+- Effective batch size / steps per phase: TODO
+- Context lengths per phase (e.g., 512 → 1k/2k): TODO
+- Tokenizer/vocab: TODO
+- Hardware & wall-clock: TODO
+- Checkpoint tags: TODO (e.g., `pretrain`, `ext`, `decay`)
 ---
 ---
 ## Evaluation
+### Token Classification
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/DuUWO7SyzxJsN53dOSV60.png)
 > With 2–3x fewer parameters, RexBERT surpasses the performance of the ModernBERT series.
 ### Semantic Similarity
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/CPrf6J1ioUGzr6vohJ4xU.png)
+> RexBERT models outperform all the models in their parameter/size category.
 ---
 ```python
 from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
+m = AutoModelForMaskedLM.from_pretrained("thebajajra/RexBERT-base")
+t = AutoTokenizer.from_pretrained("thebajajra/RexBERT-base")
 fill = pipeline("fill-mask", model=m, tokenizer=t)
 fill("Best [MASK] headphones under $100.")
 import torch
 from transformers import AutoTokenizer, AutoModel
+tok = AutoTokenizer.from_pretrained("thebajajra/RexBERT-base")
+enc = AutoModel.from_pretrained("thebajajra/RexBERT-base")
 texts = ["nike air zoom pegasus 40", "running shoes pegasus zoom nike"]
 batch = tok(texts, padding=True, truncation=True, return_tensors="pt")
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
+tok = AutoTokenizer.from_pretrained("thebajajra/RexBERT-base")
+model = AutoModelForSequenceClassification.from_pretrained("thebajajra/RexBERT-base", num_labels=NUM_LABELS)
 # Prepare your Dataset objects: train_ds, val_ds (text→label)
 args = TrainingArguments(
 ## Model Architecture & Compatibility
+- **Architecture:** Encoder-only, ModernBERT-style **base** model.
 - **Libraries:** Works with **🤗 Transformers**; supports **fill-mask** and **feature-extraction** pipelines.
 - **Context length:** Increased during the **Context Extension** phase—ensure `max_position_embeddings` in `config.json` matches your desired max length.
 - **Files:** `config.json`, tokenizer files, and (optionally) heads for MLM or classification.
 ## License
+- **License:** `apache-2.0`.
 ---
 ## Maintainers & Contact
+- **Authors:** [Rahul Bajaj](https://huggingface.co/thebajajra), [Anuj Garg ](https://huggingface.co/anujga)
 ---