CRITICAL FIX: Switch from ByteLevel to Whitespace pre-tokenizer β fixes 42% UNK rate on domain token sequences a9c4a62 verified rtferraz commited on 1 day ago
Add finetune.py β finetune_domain_model (HF Trainer Pattern A, auto tabular_features passthrough) 46a6d37 verified rtferraz commited on 7 days ago
Phase 2D: Fine-tuning pipeline β DomainFinetuneDataset, finetune_domain_model, 139 total tests passing 256963c verified rtferraz commited on 7 days ago
Add pretrain.py β pretrain_domain_model with HF Trainer, cosine schedule, DataCollatorForLanguageModeling 6ccb9e6 verified rtferraz commited on 8 days ago
Add data_pipeline.py β tokenize_user_sequences, pack_sequences, prepare_clm_dataset 1dfd4e2 verified rtferraz commited on 8 days ago
Phase 2C: Pre-training pipeline β data pipeline, sequence packing, HF Trainer CLM, 124 total tests passing 28118c7 verified rtferraz commited on 8 days ago
Add DCNv2 + JointFusionModel (nuFormer-style Transformer + tabular fusion) e881ea3 verified rtferraz commited on 8 days ago
Add DomainTransformerForCausalLM β GPT-style NoPE model with SDPA attention, weight tying, HF Trainer compatible 0dec8e4 verified rtferraz commited on 8 days ago
Add DomainTransformerConfig with presets (24M/85M/330M) 15fbfea verified rtferraz commited on 8 days ago
Phase 2B: Model architecture β DomainTransformerForCausalLM (NoPE, GPT-style), PLR embeddings, DCNv2 + JointFusion, 105 passing tests 2f5969e verified rtferraz commited on 8 days ago
Add predefined schemas (FINANCE, ECOMMERCE, HEALTHCARE) c00ac2c verified rtferraz commited on 8 days ago
Add domain_tokenizer.py β DomainTokenizerBuilder (core assembler, HF integration) 818a2e9 verified rtferraz commited on 8 days ago
Add field_tokenizers.py β Sign, MagnitudeBucket, Calendar, Categorical, DiscreteNumerical tokenizers 511f3aa verified rtferraz commited on 8 days ago
Add schema.py β DomainSchema, FieldSpec, FieldType definitions 1a9dad0 verified rtferraz commited on 8 days ago
Phase 2A: Core tokenizer library β schema, field tokenizers, composite builder, predefined schemas, 72 passing tests 0c1ca58 verified rtferraz commited on 8 days ago