# Phase 2A–2D Implementation Report

> **domainTokenizer v0.4.0** — Core library complete: tokenizers, models, pre-training, fine-tuning
>
> **139 tests passing** (72 tokenizer + 33 model + 19 pre-training + 15 fine-tuning)
>
> *April 2026*

---

## Overview

Phase 2 implements the complete domainTokenizer library — everything needed to go from raw domain events (financial transactions, e-commerce actions, clinical encounters) to a fine-tuned downstream prediction model. The implementation directly follows the validated patterns from Nubank's nuFormer (arXiv:2507.23267), with architecture decisions grounded in 6 audited reference papers.

The library is organized as four layers, each built and tested independently before composing into the next:

```
Phase 2A: Tokenizers  →  Phase 2B: Models  →  Phase 2C: Pre-training  →  Phase 2D: Fine-tuning
(schema → tokens)        (tokens → loss)      (CLM on sequences)         (joint fusion on labels)
```

---

## Phase 2A: Domain Tokenizer Library (Weeks 1–3)

### What Was Built

A declarative schema system and 5 per-field tokenizers that convert raw domain events into HuggingFace-compatible token sequences.

| Component | Purpose | Output |
|-----------|---------|--------|
| `DomainSchema` + `FieldSpec` | Declarative event definition — fields, types, bin counts | Schema object |
| `SignTokenizer` | Credit/debit, +/- | `79.99 → [AMT_SIGN_POS]` |
| `MagnitudeBucketTokenizer` | Quantile-based numerical bins (fits on data) | `79.99 → [AMT_15]` |
| `CalendarTokenizer` | Timestamp → month/dow/dom/hour decomposition | `Mar 15 2pm → 4 tokens` |
| `CategoricalTokenizer` | Fixed category mapping with UNK fallback | `"purchase" → [EVT_001]` |
| `DiscreteNumericalTokenizer` | Small integers with overflow | `3 → [QTY_03]`, `15 → [QTY_OVER]` |
| `DomainTokenizerBuilder` | Assembles per-field tokenizers → HF `PreTrainedTokenizerFast` | HF tokenizer |

Three predefined schemas ship out of the box:
- `FINANCE_SCHEMA` — 97 domain tokens (Nubank-compatible: sign + amount bins + calendar)
- `ECOMMERCE_SCHEMA` — event type + price + quantity + category + calendar + product title
- `HEALTHCARE_SCHEMA` — clinical event type + cost + severity + provider + calendar + description

### Key Technical Decisions

1. **Hybrid vocabulary: special tokens + BPE.** Following Nubank exactly, structured fields (amounts, dates, categories) become single special tokens while free-text fields (descriptions, product titles) use standard BPE. This compresses each event to ~14 tokens vs ~35-50 with pure text serialization, tripling the number of events that fit in a 2048-token context window.

2. **Quantile-based magnitude binning (not linear).** The `MagnitudeBucketTokenizer` uses quantile percentiles on absolute values, not uniform bins. Financial data is heavily skewed (many small transactions, few large ones). Quantile bins ensure each bin gets roughly equal representation in the training data, maximizing the model's ability to distinguish between common transaction sizes.

3. **Separate sign and magnitude tokenization.** Following Nubank's `ϕ_sign` + `ϕ_amt` pattern, the sign (credit/debit) is tokenized independently from the magnitude. This lets the model learn that "a $500 inflow" and "a $500 outflow" share magnitude semantics but differ in direction — without wasting bins on both positive and negative ranges.

4. **Schema-driven factory pattern.** Field tokenizers are created automatically from `FieldSpec` declarations via `create_field_tokenizer()`. Adding a new domain requires only defining a `DomainSchema` — no code changes to the tokenizer pipeline. This enables rapid domain iteration (finance → e-commerce → healthcare) without engineering overhead.

5. **Data-dependent tokenizers require explicit fitting.** `MagnitudeBucketTokenizer` must be `.fit()` on training data before use. Calling `.build()` on an unfitted schema raises `RuntimeError`. This prevents a subtle bug where bin edges are computed on test data, leaking information.

6. **HuggingFace-native output.** The `DomainTokenizerBuilder.build()` method produces a standard `PreTrainedTokenizerFast` — the same type returned by `AutoTokenizer.from_pretrained()`. This means zero adaptation for HF Trainer, `push_to_hub()`, `save_pretrained()`, ONNX export, etc.

### Test Results

**72 tests passing** covering: field spec validation, all 5 tokenizer types (including edge cases: NaN, None, overflow, unknown categories), predefined schemas (including Nubank 97-token compatibility check), builder fit/build/tokenize/encode pipeline, and full end-to-end sequence encoding.

---

## Phase 2B: Model Architecture (Weeks 3–5)

### What Was Built

A GPT-style causal decoder Transformer registered as a HuggingFace `PreTrainedModel`, plus numerical embeddings and joint fusion components.

| Component | Purpose | Based On |
|-----------|---------|----------|
| `DomainTransformerConfig` | HF-compatible config with presets (`"24m"`, `"85m"`, `"330m"`) | Nubank nuFormer sizes |
| `DomainTransformerForCausalLM` | Causal decoder: NoPE, pre-norm, SDPA attention, weight tying | NoPE (arXiv:2305.19466) + GPT-2 |
| `PeriodicLinearReLU` | Learned sin/cos embeddings for numerical features | Gorishniy et al. (arXiv:2203.05556) |
| `DCNv2` + `JointFusionModel` | Transformer + tabular feature fusion for fine-tuning | Nubank + DCN V2 (arXiv:2008.13535) |

### Key Technical Decisions

1. **NoPE (No Positional Encoding).** Following Kazemnejad et al. (NeurIPS 2023), the model uses zero positional encoding — no absolute, no RoPE, no ALiBi. NoPE outperforms all PE schemes on length generalization benchmarks. For domain sequences where users have vastly different history lengths (20 to 2000+ events), length generalization is critical. The model implicitly learns relative position from the causal attention mask pattern.

2. **`F.scaled_dot_product_attention` with `is_causal=True`, not `nn.MultiheadAttention`.** PyTorch's `nn.MultiheadAttention(is_causal=True)` has a known bug requiring an explicit `attn_mask` even when `is_causal=True` is set. We implement attention directly using `F.scaled_dot_product_attention`, which auto-dispatches to FlashAttention/cuDNN when available on CUDA, and uses an efficient C++ kernel on CPU.

3. **HF attention mask conversion.** HuggingFace Trainer sends attention masks as `(B, T)` long tensors (1=attend, 0=pad). PyTorch SDPA requires either `None` (use `is_causal`) or a float mask where masked positions are `-inf`. The attention module handles this conversion: when a mask is provided, it's expanded to `(B, 1, 1, T)`, converted to float, and inverted (`0 → -inf`, `1 → 0.0`). When no mask is provided, `is_causal=True` handles causality for free.

4. **Weight tying via HF v5.7+ dict format.** The `_tied_weights_keys` API changed from a list to a dict in transformers 5.7. We use `{"lm_head.weight": "model.embed_tokens.weight"}` with proper `get/set_input_embeddings` and `get/set_output_embeddings` implementations. `post_init()` handles the actual tying.

5. **Pre-norm architecture (LayerNorm before attention/FFN).** GPT-2 and most modern LLMs use pre-norm. This makes training more stable than post-norm, especially at the 24M–330M scale where we don't have the luxury of extensive hyperparameter tuning.

6. **`get_user_embedding()` method on the CausalLM class.** For downstream tasks (classification, joint fusion), we need a single vector representing the user's transaction history. This method extracts the hidden state at the last non-padding position — the standard approach for decoder-only models. It uses `attention_mask.sum(dim=1) - 1` to find the last real token position per sequence.

7. **PLR frequencies and phases are learned parameters.** Unlike fixed Fourier features, PLR initializes frequencies and phases as trainable `nn.Parameter` tensors. This lets the model discover the most informative frequency decomposition for each numerical feature during training — crucial for financial data where relevant scales span 4+ orders of magnitude.

### Test Results

**33 tests passing** covering: config presets and serialization, base model forward shapes, CausalLM with/without labels, loss differentiability, weight tying verification, user embedding extraction (with and without mask), parameter counts for tiny and 24M configs, gradient checkpointing, causal masking verification, PLR shapes and gradients, DCNv2 cross layers, JointFusion binary and multiclass, and full tokenizer→model→loss integration.

---

## Phase 2C: Pre-training Pipeline (Weeks 5–7)

### What Was Built

A data pipeline and training harness that connects the tokenizer and model layers into a complete CLM pre-training workflow.

| Component | Purpose |
|-----------|---------|
| `tokenize_user_sequences()` | Converts lists of user event sequences → variable-length token ID lists |
| `pack_sequences()` | Packs variable-length sequences into fixed-length blocks (run_clm.py pattern) |
| `prepare_clm_dataset()` | Convenience pipeline: user events → tokenize → pack → HFDataset |
| `pretrain_domain_model()` | Pre-trains via HF Trainer with DataCollatorForLanguageModeling, cosine schedule |

### Key Technical Decisions

1. **Sequence packing, not padding.** Following the official HF `run_clm.py` pattern, all tokenized user sequences are concatenated into one long stream and split into fixed-length blocks. This achieves 100% token utilization — every position in every training example is a real token contributing gradient signal. Padding wastes 30-70% of tokens for variable-length sequences, which is unacceptable when training data is finite (typical business scenario). The trade-off: cross-sequence boundaries exist within blocks. For domain events delimited by `[BOS]`/`[EOS]`/`[SEP_EVENT]` tokens, this is benign — the model learns to handle delimiters naturally.

2. **`DataCollatorForLanguageModeling(mlm=False)` handles label creation.** The HF Trainer does NOT auto-inject `labels`. The data collator does: it clones `input_ids`, sets `labels = input_ids`, and masks any padding positions (token_id == pad_token_id) with `-100` so they don't contribute to loss. Our packed sequences have no padding, so `labels == input_ids` exactly — every token is a training target.

3. **`processing_class` parameter (not `tokenizer`).** HuggingFace Trainer v5.7 renamed `tokenizer` to `processing_class` in `Trainer.__init__()`. The old name raises `TypeError`. This is a silent API break that only manifests at runtime — caught and fixed during testing.

4. **Cosine learning rate schedule with warmup.** Following Nubank and standard GPT pre-training practice. The cosine schedule decays smoothly from peak LR to near-zero, avoiding the abrupt drops of step schedules. Warmup prevents early training instability when loss gradients are large and noisy.

5. **`disable_tqdm=True` and `logging_strategy="steps"`.** For cloud/headless execution, tqdm progress bars are useless (they produce thousands of `\r` characters in log files). Plain text step-by-step logging (`loss=X.XXX, grad_norm=Y.YYY, lr=Z.ZZZ`) is greppable and parseable by monitoring tools.

6. **Dataset yields only `{"input_ids": [...]}`.** The collator adds `labels` and `attention_mask`. The Trainer's `remove_unused_columns=True` (default) auto-drops any extra columns not in the model's `forward()` signature. This means you can safely store metadata (user IDs, sequence lengths) in the dataset — they're dropped before batching.

### Smoke Test Results

24-step training on CPU with a tiny model (64-dim, 2 layers) confirmed the full pipeline:

```
Step  1: loss=5.419  grad_norm=7.227  lr=1.000e-03
Step 12: loss=4.510  grad_norm=3.668  lr=5.653e-04
Step 24: loss=4.322  grad_norm=3.636  lr=4.278e-06
```

Loss decreased monotonically from 5.42 to 4.32 with cosine decay — the tokenizer→packing→collator→model→loss→optimizer pipeline is end-to-end functional.

### Test Results

**19 tests passing** covering: tokenization of user sequences (variable lengths, BOS/EOS presence), packing (fixed blocks, concatenation, remainder dropping, error on insufficient data), full dataset preparation, DataCollator behavior (label creation, shapes, all-ones attention mask for packed data), integration forward pass with backward, Trainer smoke test (24 steps), and validation that missing pad_token raises correctly.

---

## Phase 2D: Fine-tuning Pipeline (Weeks 7–9)

### What Was Built

A supervised fine-tuning pipeline for the JointFusionModel — the nuFormer-style architecture that combines a pre-trained transaction Transformer with DCNv2(PLR) tabular features for downstream prediction tasks.

| Component | Purpose |
|-----------|---------|
| `DomainFinetuneDataset` | Per-user torch Dataset yielding `{input_ids, attention_mask, tabular_features, labels}` |
| `prepare_finetune_dataset()` | Convenience constructor with validation and logging |
| `finetune_domain_model()` | Fine-tunes JointFusionModel via HF Trainer — zero subclassing needed |

### Key Technical Decisions

1. **HF Trainer Pattern A — zero custom code required.** The critical discovery: HuggingFace Trainer inspects `JointFusionModel.forward(self, input_ids, attention_mask, tabular_features, labels)` via `inspect.signature()`. Because `tabular_features` is a named parameter in the forward signature, the Trainer auto-keeps it from the dataset and passes it to the model. No `compute_loss` override, no `remove_unused_columns=False`, no Trainer subclass. This was verified empirically on transformers 5.7.0 — the Trainer's `_set_signature_columns_if_needed()` method builds the allowed column list directly from the model's `forward()` parameters, and this works identically for plain `nn.Module` and `PreTrainedModel`.

2. **Per-user padding, not packing.** Unlike pre-training (which packs sequences for 100% token utilization), fine-tuning uses per-user padded sequences. The reason: each training sample needs its own label. In pre-training, the "label" is the next token — shared across the packed block. In fine-tuning, the label is a user-level outcome (e.g., "will this user activate a product?") — each user is a separate sample with its own label. Padding tokens are masked in the attention via `attention_mask`, so they don't affect the user embedding extracted by `get_user_embedding()`.

3. **Dataset returns tensors directly, no custom collator.** The `DomainFinetuneDataset.__getitem__()` returns pre-tokenized, pre-padded torch tensors. The default PyTorch `DataLoader` collation (stack tensors into batches) is sufficient. No `DataCollatorForLanguageModeling` needed — that's pre-training only. This simplifies the pipeline and avoids double-padding issues.

4. **`save_strategy` is configurable (not hardcoded).** During testing, we discovered that saving JointFusionModel checkpoints via safetensors fails because the wrapped DomainTransformerForCausalLM has tied weights (lm_head ↔ embed_tokens), and safetensors rejects shared tensor storage by default. The fix: `save_strategy` is exposed as a parameter so users can set `"no"` during experimentation or use custom saving logic for production. This is a known HF issue with wrapper models containing tied-weight sub-models.

5. **Binary and multiclass via `n_classes` parameter.** The same `JointFusionModel` and `finetune_domain_model()` handle both binary classification (`n_classes=1`, BCE loss) and multiclass (`n_classes>1`, CE loss). The loss function switches automatically based on `n_classes`. Labels are `float` for binary and `long` for multiclass — the dataset returns `float32` by default, and the caller casts to `long` for multiclass.

### Smoke Test Results

5-step fine-tuning on CPU with a tiny model confirmed the full pipeline:

```
Step 1: loss=0.750  grad_norm=7.158  lr=1.000e-03
Step 3: loss=0.996  grad_norm=3.771  lr=6.545e-04
Step 5: loss=0.818  grad_norm=2.681  lr=9.549e-05
Train loss: 0.752 (5 steps, 20 samples, batch=4)
```

Both the Transformer branch and PLR+DCNv2 tabular branch received gradients — end-to-end joint training is functional.

### Test Results

**15 tests passing** covering: dataset creation (length, keys, shapes, padding correctness, attention mask alignment, dtypes, length mismatch error, stats), DataLoader batching, forward pass on real dataset batches, backward gradient flow through both branches, multiclass classification, HF Trainer smoke test (5 steps), and the `prepare_finetune_dataset` convenience function.

---

## Cumulative Test Summary

| Phase | Tests | Coverage |
|-------|-------|----------|
| 2A: Tokenizers | 72 | Schema validation, 5 field tokenizers (edge cases: NaN, None, overflow, UNK), predefined schemas, builder pipeline, end-to-end encoding |
| 2B: Models | 33 | Config presets, forward pass shapes, loss computation, weight tying, user embeddings, param counts, gradient checkpointing, causal masking, PLR, DCNv2, JointFusion, tokenizer→model integration |
| 2C: Pre-training | 19 | Tokenization, packing, collation, DataCollator behavior, forward+backward integration, 24-step Trainer smoke test, error handling |
| 2D: Fine-tuning | 15 | Dataset creation/validation, batching, forward/backward through JointFusion, 5-step Trainer smoke test, multiclass, convenience function |
| **Total** | **139** | **All passing** |

---

## Library API Summary (v0.4.0)

```python
from domain_tokenizer import (
    # Schemas
    DomainSchema, FieldSpec, FieldType,
    # Tokenizers
    DomainTokenizerBuilder,
    # Models
    DomainTransformerConfig, DomainTransformerForCausalLM,
    PeriodicLinearReLU, JointFusionModel, DCNv2,
    # Pre-training
    prepare_clm_dataset, pretrain_domain_model,
    # Fine-tuning
    DomainFinetuneDataset, prepare_finetune_dataset, finetune_domain_model,
)
from domain_tokenizer.schemas import FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
```

### End-to-End Usage: Pre-training → Fine-tuning

```python
# 1. Build tokenizer from schema
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
builder.fit(all_events)
hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)

# 2. Prepare packed training data
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)

# 3. Create and pre-train model
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
model = DomainTransformerForCausalLM(config)
pretrain_domain_model(
    model, hf_tokenizer, dataset,
    hub_model_id="org/finance-24m",
    num_epochs=10, learning_rate=3e-4, bf16=True,
)

# 4. Create joint fusion model for fine-tuning
fusion = JointFusionModel(
    transformer_model=model,        # pre-trained, unfrozen
    n_tabular_features=291,         # hand-crafted tabular features
    n_classes=1,                    # binary: will user activate product?
)

# 5. Prepare fine-tuning data
ft_dataset = prepare_finetune_dataset(
    user_sequences, tabular_features, labels,
    builder, hf_tokenizer, max_length=512,
)

# 6. Fine-tune
finetune_domain_model(
    fusion, ft_dataset,
    num_epochs=5, learning_rate=1e-4, bf16=True,
)
```