thebajajra commited on
Commit
1e91ca9
·
verified ·
1 Parent(s): d14e4c3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -1
README.md CHANGED
@@ -141,8 +141,33 @@ RexBERT-micro was trained in **three phases**:
141
 
142
  ## Data Overview
143
 
 
144
  - **Domain mix:**
145
- - **Data quality:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
 
147
 
148
 
 
141
 
142
  ## Data Overview
143
 
144
+ - **Dataset:** [Ecom-niverse](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
145
  - **Domain mix:**
146
+
147
+ We identified 9 E-commerce overlapping domains which have significant amount of relevant tokens but required filteration. Below is the domain list and their filtered size
148
+ | Domain | Size (GBs) |
149
+ |---|---|
150
+ | Hobby | 114 |
151
+ | News | 66 |
152
+ | Health | 66 |
153
+ | Entertainment | 64 |
154
+ | Travel | 52 |
155
+ | Food | 22 |
156
+ | Automotive | 19 |
157
+ | Sports | 12 |
158
+ | Music and Dance | 7 |
159
+
160
+ Additionally, there are 6 more domains which had almost complete overlap and were picked directly out of FineFineWeb.
161
+ | Domain | Size (GBs) |
162
+ |---|---|
163
+ | Fashion | 37 |
164
+ | Beauty | 37 |
165
+ | Celebrity | 28 |
166
+ | Movie | 26 |
167
+ | Photo | 15 |
168
+ | Painting | 2 |
169
+
170
+ By focusing on these domains, we narrow the search space to parts of the web data where shopping-related text is likely to appear. However, even within a chosen domain, not every item is actually about buying or selling, many may be informational articles, news, or unrelated discussions. Thus, a more fine-grained filtering within each domain is required to extract only the e-commerce-specific lines. We accomplish this by training lightweight classifiers per domain to distinguish e-commerce context vs. non-e-commerce content.
171
 
172
 
173