Why LFM2.5 for Your Transaction Foundation Model

If you are putting a transaction foundation model into production — especially across more than one business unit, customer, or downstream task — the architectural choice determines per-customer training cost, time-to-first-production-task, and the marginal cost of adding the second and third tasks. The encoder-on-pretrained-backbone architecture applies a recipe Liquid AI already ships in LFM2.5-Audio and LFM2.5-VL to discrete-feature payment sequences. Three properties, each with a different kind of evidence.

1. The Recipe Is Already Shipping for Two Other Modalities

A small per-modality encoder produces continuous embeddings; a projection adapter (when needed) maps them into the LFM2.5 text backbone's hidden space; LoRA adapts the attention layers per customer. LFM2.5-Audio ingests waveforms this way. LFM2.5-VL ingests vision patches this way. This demo applies the same shape to discrete transaction tokens.

LFM2.5-AUDIO

Audio encoder → projection → LFM2.5 backbone. Ships in production.

LFM2.5-VL

Vision encoder → projection → LFM2.5 backbone. Ships at multiple sizes.

TRANSACTIONS (THIS DEMO)

Structured encoder → frozen LFM2.5-350M + LoRA → multi-head outputs.

2. The Backbone Serves at Production Latency

LFM2.5's conv-dominant layer stack gives O(N) prefill scaling on most layers where a pure-attention model pays O(N²). Published hardware-in-the-loop benchmarks from the LFM2 technical report, S25 / 4K context:

{_table_header(["Model", "Prefill (tok/s)", "Decode (tok/s)"])} {_table_row(["LFM2-2.6B", "116", "30.0"], highlight=True)} {_table_row(["Qwen3-4B", "35", "11.4"])} {_table_row(["Llama-3.2-3B", "51", "15.8"])}

Your serving path is the published LFM2.5 backbone unchanged — only the input side differs from a text deployment. The published latency advantage transfers directly. Source: LFM2 Technical Report, arXiv 2511.23404.

3. Frozen Backbone + LoRA Is the Higher-Quality Configuration at Typical Label Budgets

Freezing the LFM2.5 backbone and adapting it with LoRA produces higher quality than unfreezing the full backbone end-to-end at typical finserv label budgets. LoRA’s low-rank update structure acts as effective regularization; lifting that constraint lets the backbone memorize the training labels rather than generalize. The frozen-backbone commitment is not a quality compromise — it is the higher-quality operating point.

{_table_header(["Configuration", "Trainable", "Fraud ROC-AUC", "MCC top-1"])} {_table_row(["Frozen backbone + LoRA (this demo)", "~16M", "0.951", "40.5%"], highlight=True)} {_table_row(["Full backbone unfreeze", "~370M", "0.900", "38.1%"])}

Measured on 200K synthetic sequences (64 transactions × 15 features each). At ~16M trainable parameters (encoder + LoRA + heads), per-customer adaptation is small relative to the deployed footprint and completes in hours, not days.

One Backbone, Many Heads

A single forward pass through the backbone produces hidden states that four task heads pool independently — fraud detection, next-merchant prediction, amount-bucket forecasting, MCC classification. New use-cases (disputes, authorization optimization, AML) add a head, not a foundation model.

Per-head MLP: ~0.5M params. Add a new task in hours.

One Backbone, Many Customers

The pretrained LFM2.5 weights ship once. Per-customer training is the encoder + LoRA + heads — under 5% of base size in bf16 slim format. Adding a customer is loading new artifacts on top of the cached backbone, not retraining from scratch.

Slim per-customer artifact: ~30 MB bf16 at LFM2.5-1.2B scale.

The Architecture Matches Transaction Data Structure

Transaction data is information-dense locally (within-transaction feature correlations, adjacent-transaction continuity) with sparse long-range signal (behavioral baselines across the full history). LFM2.5 allocates O(N) conv to the dense local patterns and O(N²) attention to the sparse global ones. A pure transformer would spend O(N²) compute uniformly.

Within Transaction

Merchant determines MCC. Entry mode correlates with amount. Dense, local, often deterministic. A 3-wide conv kernel captures this.

Adjacent Transactions

Strong temporal continuity. A customer at Starbucks at 8am is likely at a similar merchant tomorrow. Local conv handles it.

Distant Transactions

Weak but non-zero signal. Behavioral profile matters for fraud baseline. This is where attention earns its quadratic cost.

Your Data, Your Model, Your Infrastructure

LFM2.5 base weights are open. Liquid licenses the architecture, training recipe, and engineering support. Customers train on their proprietary data behind their firewall. No data leaves customer infrastructure. No dependency on external model APIs. The result is a foundation model the customer owns, adapted to their transaction distribution.

Scope of Claim

What this demo validates

The encoder-on-pretrained-backbone architecture used by LFM2.5-Audio and LFM2.5-VL applies to discrete-feature transaction sequences without modifying the transformers library.
Per-customer training touches ~2–5% of the deployed footprint and trains in hours rather than days.
On synthetic data, frozen-backbone-plus-LoRA outperforms full-backbone unfreezing on every measured head.
One pretrained backbone serves all task heads and is identical across customer deployments.

What a POC on your data would establish

Whether synthetic-data quality numbers reproduce on your transaction distribution.
Production-scale quality at LFM2.5-1.2B on your hardware and sequence lengths (this demo runs at LFM2.5-350M).
Inference latency against your authorization-decision budget at your concurrency.
Cross-customer or cross-business-unit transfer of the encoder and LoRA artifacts.

Architecture: arXiv 2511.23404 · Weights: huggingface.co/LiquidAI

Integration Architecture

How a customer team builds this stack end to end. Six components, three ship from Liquid (LFM2.5 base weights, training recipes, architecture support); three are customer-bespoke (schema, encoder, task heads). Per-customer adaptation is one ML engineer for a few weeks, not a research project.

{_pill("Preprocess")}{arrow} {_pill("Encode")}{arrow} {_pill("Backbone + LoRA")}{arrow} {_pill("Heads")}{arrow} {_pill("Postprocess")}{arrow} {_pill("Deploy")}

{_phase_card( "1", "Schema & Preprocessing", "Define the discrete feature schema first — features, vocab sizes, ordering. " "Categorical features (merchant_id, MCC, country) map directly to integer IDs. " "Continuous features (amount, days-since-last) get quantile-bucketed into N bins. " "High-cardinality features (10K+ merchants) have their long tail bucketed or " "factored. Reserve 3 token IDs per feature for MASK / OOV / NULL. " "Final per-customer batch shape: (B, T_tx, F).", "Sequence: 64 tx × 15 feat = 960 tokens | " "amount → 16 quantile bins | " "merchant_id top-10K + frequency bucketing for the tail | " "unseen values at inference → OOV (ID 1)" )} {_phase_card( "2", "Structured Encoder", "One embedding table per feature (sized to its vocab) plus a shared " "feature-type table. Value + type embeddings are summed to identify " "which feature each token represents. The 15 per-tx feature embeddings " "are kept as separate positions in the sequence — compressing them to " "one token per transaction collapses fraud quality (fraud ROC-AUC drops " "to 0.535 on this demo's data), because the per-tx MLP averages away the " "intra-tx feature combinations fraud depends on. This is the same shape " "as the audio and vision encoders' input embedding step.", "Output shape: (B, T_tx*F, d_lfm) = (B, 960, 1024) at LFM2.5-350M | " "value_tables[f](token) + type_table(f) | " "Encoder params dominated by high-cardinality value tables (~14M at 350M)" )} {_phase_card( "3", "Projection Adapter (When Needed)", "When the encoder's output dimension matches d_lfm directly, no adapter " "is needed — the encoder outputs flow straight into the backbone. When " "d_encoder < d_lfm (typical at LFM2.5-1.2B where d_lfm=2048), a single " "linear projection lifts the encoder output into the backbone hidden space, " "exactly mirroring the audio/VL projection adapter. Layer init: identity " "for d_encoder=d_lfm, Xavier for the projection case.", "350M: d_lfm=1024, d_encoder=1024, no adapter (identity) | " "1.2B: d_lfm=2048, project from d_encoder=512-1024 → 2048 | " "Adds ~2M params at 1.2B scale" )} {_phase_card( "4", "Backbone + LoRA", "Load the pretrained LFM2.5 base from Hugging Face. The backbone’s " "parameters are excluded from the optimizer’s parameter set during " "training — gradients flow through the backbone to update the upstream " "encoder and downstream heads, but the backbone’s own weights are " "never modified. Forward pass executes through all 354M backbone " "parameters at full capacity, at both training and inference time. " "Customer-distribution adaptation enters through (i) LoRA’s low-rank " "delta on the attention projections (q_proj / k_proj / v_proj / out_proj) " "and (ii) the per-feature encoder, both trained from scratch on customer " "labels. Encoder outputs are injected via the published " "inputs_embeds hook in Lfm2Model.forward. " "Adding LoRA to the conv layers does not improve quality enough to justify the ~50% increase in training cost; attention-only LoRA is the recommended starting configuration.", "Backbone params excluded from optimizer; backbone forward at full capacity | " "LoRA r=16, α=32, dropout 0.05 on q_proj / k_proj / v_proj / out_proj | " "PEFT wraps the leaf modules | " "~1M LoRA params at 350M, ~2M at 1.2B" )} {_phase_card( "5", "Task Heads", "Per-task downstream heads pool backbone hidden states and predict via " "small MLPs. Fraud (BCE loss) pools the last-transaction stripe — " "mean of positions T-F..T (positions 945..959). Categorical heads " "(next-merchant, amount-bucket, MCC) use cross-entropy and pool the " "pre-last transaction stripe (positions 930..944) to avoid " "leaking the prediction target. New tasks add a head, backbone " "untouched.", "Per-head MLP: 128 hidden, dropout 0.1 | " "Pool: last_tx_mean for sequence tasks | " "Pool: pre_last_tx_mean for next-tx tasks | " "~0.5M params per head" )} {_phase_card( "6", "Postprocessing", "Fraud logits → sigmoid → probability in [0, 1]; calibrate against the " "customer's operational threshold (typical: 70% precision @ 60% recall " "for review-queue handoff). Categorical logits → softmax → top-k " "distribution. Use the predicted distribution for downstream " "decisioning, not just argmax — the runner-up matters when the top-1 " "is uncertain. Behavioral attribution: gradient-based saliency on the " "per-feature embeddings identifies which input features drove the score.", "Fraud: sigmoid(logits) → operational threshold | " "Categorical: softmax(logits) → top-k + calibration | " "Saliency: ∂loss/∂value_embed identifies driving features" )}

Training Recipe

Single-stage supervised fine-tune on the customer’s labelled data — no separate pretraining stage. Three trainable parameter groups (LoRA delta, per-feature encoder, task heads), three learning rates, because each group differs in initialization, parameter scale, and gradient-norm profile.

LORA GROUP

lr = 1e-3 · ~1M params

Low-rank adapters on the backbone’s attention projections. Initialized so the LoRA path contributes zero at step 0, then steers attention behavior toward the customer’s distribution. Higher LR than the encoder group is fine — the low-rank constraint regularizes the update by construction.

ENCODER GROUP

lr = 3e-4 · ~14M params

Per-feature value tables + feature-type table, from random init on the customer’s tokenized vocabulary. Lower LR than LoRA because random-init embedding matrices destabilize at higher rates; high-cardinality tables (10K-vocab merchant) dominate gradient norm if not damped.

HEADS GROUP

lr = 1e-3 · ~2M params

Per-task MLPs (fraud, next-merchant, amount-bucket, MCC), from random init. Higher LR is fine — small per-head parameter count, well-conditioned loss surface. New downstream tasks attach as additional heads without retraining the backbone or the encoder.

Optimizer

AdamW, β = (0.9, 0.95), weight decay 0.1

Schedule

200-step linear warmup, cosine decay to 10% of peak over ~5K steps

Precision

bf16 forward and backward, fp32 loss accumulation

Multi-task

fraud 1.0, categorical heads 0.5 each — chosen to match per-task gradient norm in the first 200 warmup steps

Compute

~2 hours end-to-end on a single A100 at LFM2.5-350M scale

{_phase_card( "Deploy", "Per-Customer Adapter on a Shared Backbone", "The deployable per-customer artifact is the trained LoRA delta + " "per-feature encoder + task heads. The LFM2.5 base is not included; " "it is loaded once per serving GPU from the public weights. At " "LFM2.5-350M the artifact is ~190 MB in bf16; the unstripped version " "including the base would be ~900 MB. Multi-tenant serving keeps " "one backbone resident on the GPU and switches the active LoRA " "delta + encoder per request, so a new customer adds a small " "adapter rather than a second foundation model. The conv-dominant " "backbone quantizes cleanly to INT8 for cost-sensitive deployments. " "Same code path runs CPU or GPU.", "Per-customer artifact: ~190 MB bf16 at LFM2.5-350M (LoRA + encoder + heads) | " "Multi-tenant: shared backbone forward + per-request LoRA switching | " "CPU (~5s/inference) or H100 (<100ms), same code | " "INT8 quantization clean for conv layers" )}

Configuration Choices That Look Right and Aren’t

{_gotcha("1", "Do not compress each transaction to one token", "an MLP that averages the 15 feature embeddings into a single " "vector destroys the intra-tx fraud signal. Fraud ROC-AUC " "collapsed to 0.535. Keep the full T_tx*F stripe.")} {_gotcha("2", "Pool the pre-last transaction, not the last, for next-tx heads", "pooling the last-tx stripe for next-merchant prediction " "leaks the prediction target into the input. Use the prior tx.")} {_gotcha("3", "Don't unfreeze the backbone at typical label budgets", "full-backbone unfreezing produces lower quality than frozen-plus-LoRA " "(fraud ROC-AUC 0.951 → 0.900 on this demo's data). LoRA acts as " "effective regularization; lifting it forces overfitting.")} {_gotcha("4", "Tied embedding heads need SSL-pretrained value tables", "tying the next-merchant head to the encoder's merchant value table " "without self-supervised pretraining of those tables reduces " "next-merchant top-1 from 7.78% to 3.74%. Use a fresh MLP head " "until SSL pretraining anchors the value tables.")} {_gotcha("5", "Average per-feature losses; do not sum", "summing CE losses across features makes high-cardinality " "features (10K-vocab merchant) dominate the gradient. The " "low-cardinality features stop training.")} {_gotcha("6", "Match the schema fingerprint between training and inference", "if the tokenizer's vocab changes between training and " "deployment, the encoder's value tables index into a different " "semantic space. Embed the fingerprint in checkpoint metadata.")}

Typical Engagement

Phase	Duration	What Happens
Discovery	1-2 weeks	Schema design, data sample (~100K-1M sequences), compliance review, architectural fit assessment.
POC	1 week	Fine-tune encoder + LoRA + heads on customer sample, measurement report, go/no-go recommendation.
Production	2-3 months	Customer engineering team builds with Liquid architectural support, weekly design review, scale-up to LFM2.5-1.2B.
Scale	Ongoing	Multi-task expansion (add heads), multi-tenant serving, retraining cadence, architecture evolution.

Architecture: arXiv 2511.23404 · Base weights: huggingface.co/LiquidAI