Commit History

Fix label leakage: temporal split β€” use first 70% of events as input, predict purchase in last 30%. Remove n_purchases/purchase_rate from features.
e4d8561
verified

rtferraz commited on

Fix model loading: use from_pretrained() instead of torch.load() for safetensors format
165b138
verified

rtferraz commited on

Add 03_ecommerce_finetune.ipynb β€” next-purchase prediction with JointFusion, LightGBM baseline comparison
857ec9a
verified

rtferraz commited on

Add e-commerce pre-training report β€” successful demo, behavioral clusters found, future improvements noted
2b3e3af
verified

rtferraz commited on

Update 02_ecommerce notebook: add HF login, memory-free cell, subsample option for <64GB RAM machines
2410b7e
verified

rtferraz commited on

CRITICAL FIX: Switch from ByteLevel to Whitespace pre-tokenizer β€” fixes 42% UNK rate on domain token sequences
a9c4a62
verified

rtferraz commited on

Add 02_ecommerce_pretrain.ipynb β€” REES46 e-commerce pre-training with sequential entropy check, wandb, push to hub
d60868a
verified

rtferraz commited on

Add finance pre-training report β€” honest analysis of results and lessons learned
709a7e2
verified

rtferraz commited on

Add .gitignore β€” Python, Jupyter, training artifacts, IDE files
9211898
verified

rtferraz commited on

Fix notebook: total_mem β†’ total_memory, add hub_model_id push, add wandb logging support
65ecf7e
verified

rtferraz commited on

Add 01_finance_pretrain.ipynb β€” Phase 3.1 notebook for pre-training on 5M Nigerian financial transactions
2c3ddfa
verified

rtferraz commited on

Phase 3.0: Pipeline validation demo on mindweave/bank-transactions-us β€” ALL 10 CHECKS PASSED
6e5b80d
verified

rtferraz commited on

Add ADR-002: Dataset selection for Phase 3 demos β€” research findings, rationale, phased plan
756d197
verified

rtferraz commited on

Update implementation report: add Phase 2D, update header to v0.4.0 / 139 tests, update cumulative summary and API
7aac458
verified

rtferraz commited on

Add fine-tuning test suite β€” 15 tests covering dataset, batching, forward/backward, Trainer smoke, multiclass
abab711
verified

rtferraz commited on

Update package to v0.4.0 with fine-tuning exports
7edb04f
verified

rtferraz commited on

Update training init with finetune exports
64d55e2
verified

rtferraz commited on

Add finetune.py β€” finetune_domain_model (HF Trainer Pattern A, auto tabular_features passthrough)
46a6d37
verified

rtferraz commited on

Phase 2D: Fine-tuning pipeline β€” DomainFinetuneDataset, finetune_domain_model, 139 total tests passing
256963c
verified

rtferraz commited on

Update README v0.3.0 β€” add usage example, update roadmap status, add implementation report link
f580186
verified

rtferraz commited on

Add Phase 2A-2C implementation report β€” technical decisions, architecture summary, test results
6c4ad4d
verified

rtferraz commited on

Add training test suite β€” 19 tests covering data pipeline, packing, collation, integration, Trainer smoke test
345d9e3
verified

rtferraz commited on

Update package to v0.3.0 with training exports
af3b720
verified

rtferraz commited on

Add pretrain.py β€” pretrain_domain_model with HF Trainer, cosine schedule, DataCollatorForLanguageModeling
6ccb9e6
verified

rtferraz commited on

Add data_pipeline.py β€” tokenize_user_sequences, pack_sequences, prepare_clm_dataset
1dfd4e2
verified

rtferraz commited on

Phase 2C: Pre-training pipeline β€” data pipeline, sequence packing, HF Trainer CLM, 124 total tests passing
28118c7
verified

rtferraz commited on

Add model test suite β€” 33 tests covering config, model, PLR, DCNv2, joint fusion, integration
ab8a8b6
verified

rtferraz commited on

Update package init to v0.2.0 with model exports
b86b1ee
verified

rtferraz commited on

Add DCNv2 + JointFusionModel (nuFormer-style Transformer + tabular fusion)
e881ea3
verified

rtferraz commited on

Add PLR embeddings (Gorishniy et al. 2022)
d685c0e
verified

rtferraz commited on

Add DomainTransformerForCausalLM β€” GPT-style NoPE model with SDPA attention, weight tying, HF Trainer compatible
0dec8e4
verified

rtferraz commited on

Add DomainTransformerConfig with presets (24M/85M/330M)
15fbfea
verified

rtferraz commited on

Phase 2B: Model architecture β€” DomainTransformerForCausalLM (NoPE, GPT-style), PLR embeddings, DCNv2 + JointFusion, 105 passing tests
2f5969e
verified

rtferraz commited on

Add comprehensive test suite β€” 72 passing tests covering all components
8efa945
verified

rtferraz commited on

Add predefined schemas (FINANCE, ECOMMERCE, HEALTHCARE)
c00ac2c
verified

rtferraz commited on

Add domain_tokenizer.py β€” DomainTokenizerBuilder (core assembler, HF integration)
818a2e9
verified

rtferraz commited on

Add field_tokenizers.py β€” Sign, MagnitudeBucket, Calendar, Categorical, DiscreteNumerical tokenizers
511f3aa
verified

rtferraz commited on

Add tokenizers package init
0b06df3
verified

rtferraz commited on

Add schemas package init
04b1d24
verified

rtferraz commited on

Add schema.py β€” DomainSchema, FieldSpec, FieldType definitions
1a9dad0
verified

rtferraz commited on

Phase 2A: Core tokenizer library β€” schema, field tokenizers, composite builder, predefined schemas, 72 passing tests
0c1ca58
verified

rtferraz commited on

Update README: add ADR reference, update documentation table and repo structure
a239d6e
verified

rtferraz commited on

Add ADR-001: Implementation framework decision with detailed roadmap
25a1093
verified

rtferraz commited on

Update README with Nubank case study and expanded repo structure
e30a14d
verified

rtferraz commited on

Add Nubank nuFormer reverse-engineering analysis β€” full pipeline reconstruction
51149fa
verified

rtferraz commited on

Add README with project overview and vision
f930fef
verified

rtferraz commited on

Add comprehensive research report on domain-specific tokenization
be86e60
verified

rtferraz commited on

initial commit
356a72e
verified

rtferraz commited on