Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
rtferraz
/
domainTokenizer
like
0
arxiv:
9 papers
Model card
Files
Files and versions
xet
Community
main
domainTokenizer
Ctrl+K
Ctrl+K
1 contributor
History:
48 commits
rtferraz
Fix label leakage: temporal split β use first 70% of events as input, predict purchase in last 30%. Remove n_purchases/purchase_rate from features.
e4d8561
verified
1 day ago
docs
Add e-commerce pre-training report β successful demo, behavioral clusters found, future improvements noted
1 day ago
examples
Phase 3.0: Pipeline validation demo on mindweave/bank-transactions-us β ALL 10 CHECKS PASSED
7 days ago
notebooks
Fix label leakage: temporal split β use first 70% of events as input, predict purchase in last 30%. Remove n_purchases/purchase_rate from features.
1 day ago
src
CRITICAL FIX: Switch from ByteLevel to Whitespace pre-tokenizer β fixes 42% UNK rate on domain token sequences
1 day ago
tests
Add fine-tuning test suite β 15 tests covering dataset, batching, forward/backward, Trainer smoke, multiclass
7 days ago
.gitattributes
Safe
1.52 kB
initial commit
7 days ago
.gitignore
452 Bytes
Add .gitignore β Python, Jupyter, training artifacts, IDE files
6 days ago
README.md
Safe
8.46 kB
Update README v0.3.0 β add usage example, update roadmap status, add implementation report link
7 days ago