"""Encoder-specific render module.
Reuses parent's format functions (fraud bar, top-k predictions, timeline)
directly via import — those are model-agnostic. The encoder demo only
adds two pieces of new content:
1. `render_why_encoder()` — the architectural pitch for the encoder pattern
2. `render_encoder_integration()` — build-it-yourself integration guide
The fraud / merchant / amount / mcc / timeline / profile formatters are
unchanged from parent. We import them so the encoder demo's prediction
cards are visually consistent.
"""
from __future__ import annotations
# Re-export parent's format functions so encoder app.py can do a single
# import from this module.
from src.demo.render import ( # noqa: F401
format_amount_predictions,
format_fraud_score,
format_mcc_predictions,
format_merchant_predictions,
format_timeline,
format_topk_predictions,
)
# Liquid design tokens
_TEXT = "#171717"
_TEXT_MUTED = "#525252"
_TEXT_DIM = "#737373"
_BG_CARD = "#ffffff"
_BG_CARD_ALT = "#fafafa"
_BORDER = "rgba(0,0,0,0.1)"
_BORDER_SUBTLE = "rgba(0,0,0,0.05)"
_ACCENT_GREEN = "#10B981"
_ACCENT_BLUE = "#3B82F6"
_ACCENT_AMBER = "#F59E0B"
_ACCENT_PURPLE = "#7c3aed"
_RADIUS_CARD = "16px"
_RADIUS_SM = "8px"
_FONT_MONO = "JetBrains Mono, ui-monospace, SFMono-Regular, monospace"
# Single max-width applied to every tab's content so the three surfaces
# read at the same width. Picked to fit the Gradio container (1280px)
# with a small inset on either side.
_CONTAINER_WIDTH = "1180px"
def render_why_encoder() -> str:
"""Why Liquid tab content.
Opens with the buyer's problem (multi-customer / multi-task
transaction-FM economics), then the published precedent
(LFM2.5-Audio / LFM2.5-VL), then the architectural and operational
properties, then scope-of-claim. Written for an external audience —
no internal codenames, no design-log register, no "we claim" framing.
"""
def _table_header(cols: list[str]) -> str:
ths = ""
for i, c in enumerate(cols):
align = "right" if i > 0 else "left"
ths += (
f'
Why LFM2.5 for Your Transaction Foundation Model
If you are putting a transaction foundation model into production —
especially across more than one business unit, customer, or downstream task —
the architectural choice determines per-customer training cost,
time-to-first-production-task, and the marginal cost of adding the second
and third tasks. The encoder-on-pretrained-backbone architecture applies
a recipe Liquid AI already ships in LFM2.5-Audio and LFM2.5-VL to
discrete-feature payment sequences. Three properties, each with a
different kind of evidence.
1. The Recipe Is Already Shipping for Two Other Modalities
A small per-modality encoder produces continuous embeddings; a projection
adapter (when needed) maps them into the LFM2.5 text backbone's hidden space;
LoRA adapts the attention layers per customer. LFM2.5-Audio ingests waveforms
this way. LFM2.5-VL ingests vision patches this way. This demo applies the
same shape to discrete transaction tokens.
LFM2.5-AUDIO
Audio encoder → projection → LFM2.5 backbone. Ships in production.
LFM2.5-VL
Vision encoder → projection → LFM2.5 backbone. Ships at multiple sizes.
TRANSACTIONS (THIS DEMO)
Structured encoder → frozen LFM2.5-350M + LoRA → multi-head outputs.
2. The Backbone Serves at Production Latency
LFM2.5's conv-dominant layer stack gives O(N) prefill scaling on most layers
where a pure-attention model pays O(N²). Published hardware-in-the-loop
benchmarks from the LFM2 technical report, S25 / 4K context:
{_table_header(["Model", "Prefill (tok/s)", "Decode (tok/s)"])}
{_table_row(["LFM2-2.6B", "116", "30.0"], highlight=True)}
{_table_row(["Qwen3-4B", "35", "11.4"])}
{_table_row(["Llama-3.2-3B", "51", "15.8"])}
Your serving path is the published LFM2.5 backbone unchanged — only
the input side differs from a text deployment. The published latency
advantage transfers directly.
Source: LFM2 Technical Report, arXiv 2511.23404.
3. Frozen Backbone + LoRA Is the Higher-Quality Configuration at Typical Label Budgets
Freezing the LFM2.5 backbone and adapting it with LoRA produces higher
quality than unfreezing the full backbone end-to-end at typical finserv
label budgets. LoRA’s low-rank update structure acts as effective
regularization; lifting that constraint lets the backbone memorize the
training labels rather than generalize. The frozen-backbone commitment
is not a quality compromise — it is the higher-quality operating
point.
{_table_header(["Configuration", "Trainable", "Fraud ROC-AUC", "MCC top-1"])}
{_table_row(["Frozen backbone + LoRA (this demo)", "~16M", "0.951", "40.5%"],
highlight=True)}
{_table_row(["Full backbone unfreeze", "~370M", "0.900", "38.1%"])}
Measured on 200K synthetic sequences (64 transactions × 15 features each).
At ~16M trainable parameters (encoder + LoRA + heads), per-customer
adaptation is small relative to the deployed footprint and completes
in hours, not days.
One Backbone, Many Heads
A single forward pass through the backbone produces hidden states that
four task heads pool independently — fraud detection, next-merchant
prediction, amount-bucket forecasting, MCC classification. New
use-cases (disputes, authorization optimization, AML) add a head, not
a foundation model.
Per-head MLP: ~0.5M params. Add a new task in hours.
One Backbone, Many Customers
The pretrained LFM2.5 weights ship once. Per-customer training is the
encoder + LoRA + heads — under 5% of base size in bf16 slim format.
Adding a customer is loading new artifacts on top of the cached
backbone, not retraining from scratch.
Slim per-customer artifact: ~30 MB bf16 at LFM2.5-1.2B scale.
The Architecture Matches Transaction Data Structure
Transaction data is information-dense locally (within-transaction
feature correlations, adjacent-transaction continuity) with sparse
long-range signal (behavioral baselines across the full history).
LFM2.5 allocates O(N) conv to the dense local patterns and O(N²)
attention to the sparse global ones. A pure transformer would spend
O(N²) compute uniformly.
Within Transaction
Merchant determines MCC. Entry mode correlates with amount. Dense,
local, often deterministic. A 3-wide conv kernel captures this.
Adjacent Transactions
Strong temporal continuity. A customer at Starbucks at 8am is
likely at a similar merchant tomorrow. Local conv handles it.
Distant Transactions
Weak but non-zero signal. Behavioral profile matters for fraud
baseline. This is where attention earns its quadratic cost.
Your Data, Your Model, Your Infrastructure
LFM2.5 base weights are open. Liquid licenses the architecture, training
recipe, and engineering support. Customers train on their proprietary
data behind their firewall. No data leaves customer infrastructure.
No dependency on external model APIs. The result is a foundation model
the customer owns, adapted to their transaction distribution.
Scope of Claim
What this demo validates
- The encoder-on-pretrained-backbone architecture used by LFM2.5-Audio
and LFM2.5-VL applies to discrete-feature transaction sequences
without modifying the transformers library.
- Per-customer training touches ~2–5% of the deployed footprint
and trains in hours rather than days.
- On synthetic data, frozen-backbone-plus-LoRA outperforms
full-backbone unfreezing on every measured head.
- One pretrained backbone serves all task heads and is identical
across customer deployments.
What a POC on your data would establish
- Whether synthetic-data quality numbers reproduce on your
transaction distribution.
- Production-scale quality at LFM2.5-1.2B on your hardware and
sequence lengths (this demo runs at LFM2.5-350M).
- Inference latency against your authorization-decision budget
at your concurrency.
- Cross-customer or cross-business-unit transfer of the encoder
and LoRA artifacts.
"""
def render_encoder_integration() -> str:
"""Build-it-yourself integration guide.
Walks the reader through every component a customer team would build
to reproduce this demo on their own data: preprocessing, encoder,
backbone wiring, heads, postprocessing, training, deployment.
Includes hyperparameter cards, gotchas, and an engagement timeline.
"""
def _phase_card(num: str, title: str, body: str, detail: str) -> str:
return f"""
Integration Architecture
How a customer team builds this stack end to end. Six components, three
ship from Liquid (LFM2.5 base weights, training recipes, architecture
support); three are customer-bespoke (schema, encoder, task heads).
Per-customer adaptation is one ML engineer for a few weeks, not a
research project.
{_pill("Preprocess")}{arrow}
{_pill("Encode")}{arrow}
{_pill("Backbone + LoRA")}{arrow}
{_pill("Heads")}{arrow}
{_pill("Postprocess")}{arrow}
{_pill("Deploy")}
{_phase_card(
"1",
"Schema & Preprocessing",
"Define the discrete feature schema first — features, vocab sizes, ordering. "
"Categorical features (merchant_id, MCC, country) map directly to integer IDs. "
"Continuous features (amount, days-since-last) get quantile-bucketed into N bins. "
"High-cardinality features (10K+ merchants) have their long tail bucketed or "
"factored. Reserve 3 token IDs per feature for MASK / OOV / NULL. "
"Final per-customer batch shape: (B, T_tx, F).",
"Sequence: 64 tx × 15 feat = 960 tokens | "
"amount → 16 quantile bins | "
"merchant_id top-10K + frequency bucketing for the tail | "
"unseen values at inference → OOV (ID 1)"
)}
{_phase_card(
"2",
"Structured Encoder",
"One embedding table per feature (sized to its vocab) plus a shared "
"feature-type table. Value + type embeddings are summed to identify "
"which feature each token represents. The 15 per-tx feature embeddings "
"are kept as separate positions in the sequence — compressing them to "
"one token per transaction collapses fraud quality (fraud ROC-AUC drops "
"to 0.535 on this demo's data), because the per-tx MLP averages away the "
"intra-tx feature combinations fraud depends on. This is the same shape "
"as the audio and vision encoders' input embedding step.",
"Output shape: (B, T_tx*F, d_lfm) = (B, 960, 1024) at LFM2.5-350M | "
"value_tables[f](token) + type_table(f) | "
"Encoder params dominated by high-cardinality value tables (~14M at 350M)"
)}
{_phase_card(
"3",
"Projection Adapter (When Needed)",
"When the encoder's output dimension matches d_lfm directly, no adapter "
"is needed — the encoder outputs flow straight into the backbone. When "
"d_encoder < d_lfm (typical at LFM2.5-1.2B where d_lfm=2048), a single "
"linear projection lifts the encoder output into the backbone hidden space, "
"exactly mirroring the audio/VL projection adapter. Layer init: identity "
"for d_encoder=d_lfm, Xavier for the projection case.",
"350M: d_lfm=1024, d_encoder=1024, no adapter (identity) | "
"1.2B: d_lfm=2048, project from d_encoder=512-1024 → 2048 | "
"Adds ~2M params at 1.2B scale"
)}
{_phase_card(
"4",
"Backbone + LoRA",
"Load the pretrained LFM2.5 base from Hugging Face. The backbone’s "
"parameters are excluded from the optimizer’s parameter set during "
"training — gradients flow through the backbone to update the upstream "
"encoder and downstream heads, but the backbone’s own weights are "
"never modified. Forward pass executes through all 354M backbone "
"parameters at full capacity, at both training and inference time. "
"Customer-distribution adaptation enters through (i) LoRA’s low-rank "
"delta on the attention projections (q_proj / k_proj / v_proj / out_proj) "
"and (ii) the per-feature encoder, both trained from scratch on customer "
"labels. Encoder outputs are injected via the published "
"inputs_embeds hook in Lfm2Model.forward. "
"Adding LoRA to the conv layers does not improve quality enough to justify the ~50% increase in training cost; attention-only LoRA is the recommended starting configuration.",
"Backbone params excluded from optimizer; backbone forward at full capacity | "
"LoRA r=16, α=32, dropout 0.05 on q_proj / k_proj / v_proj / out_proj | "
"PEFT wraps the leaf modules | "
"~1M LoRA params at 350M, ~2M at 1.2B"
)}
{_phase_card(
"5",
"Task Heads",
"Per-task downstream heads pool backbone hidden states and predict via "
"small MLPs. Fraud (BCE loss) pools the last-transaction stripe — "
"mean of positions T-F..T (positions 945..959). Categorical heads "
"(next-merchant, amount-bucket, MCC) use cross-entropy and pool the "
"pre-last transaction stripe (positions 930..944) to avoid "
"leaking the prediction target. New tasks add a head, backbone "
"untouched.",
"Per-head MLP: 128 hidden, dropout 0.1 | "
"Pool: last_tx_mean for sequence tasks | "
"Pool: pre_last_tx_mean for next-tx tasks | "
"~0.5M params per head"
)}
{_phase_card(
"6",
"Postprocessing",
"Fraud logits → sigmoid → probability in [0, 1]; calibrate against the "
"customer's operational threshold (typical: 70% precision @ 60% recall "
"for review-queue handoff). Categorical logits → softmax → top-k "
"distribution. Use the predicted distribution for downstream "
"decisioning, not just argmax — the runner-up matters when the top-1 "
"is uncertain. Behavioral attribution: gradient-based saliency on the "
"per-feature embeddings identifies which input features drove the score.",
"Fraud: sigmoid(logits) → operational threshold | "
"Categorical: softmax(logits) → top-k + calibration | "
"Saliency: ∂loss/∂value_embed identifies driving features"
)}
Training Recipe
Single-stage supervised fine-tune on the customer’s labelled data — no
separate pretraining stage. Three trainable parameter groups (LoRA delta,
per-feature encoder, task heads), three learning rates, because each group
differs in initialization, parameter scale, and gradient-norm profile.
LORA GROUP
lr = 1e-3 · ~1M params
Low-rank adapters on the backbone’s attention projections.
Initialized so the LoRA path contributes zero at step 0, then steers
attention behavior toward the customer’s distribution. Higher LR
than the encoder group is fine — the low-rank constraint regularizes
the update by construction.
ENCODER GROUP
lr = 3e-4 · ~14M params
Per-feature value tables + feature-type table, from random init on the
customer’s tokenized vocabulary. Lower LR than LoRA because
random-init embedding matrices destabilize at higher rates;
high-cardinality tables (10K-vocab merchant) dominate gradient norm if
not damped.
HEADS GROUP
lr = 1e-3 · ~2M params
Per-task MLPs (fraud, next-merchant, amount-bucket, MCC), from random
init. Higher LR is fine — small per-head parameter count, well-conditioned
loss surface. New downstream tasks attach as additional heads without
retraining the backbone or the encoder.
Optimizer
AdamW, β = (0.9, 0.95), weight decay 0.1
Schedule
200-step linear warmup, cosine decay to 10% of peak over ~5K steps
Precision
bf16 forward and backward, fp32 loss accumulation
Multi-task
fraud 1.0, categorical heads 0.5 each — chosen to match per-task
gradient norm in the first 200 warmup steps
Compute
~2 hours end-to-end on a single A100 at LFM2.5-350M scale
{_phase_card(
"Deploy",
"Per-Customer Adapter on a Shared Backbone",
"The deployable per-customer artifact is the trained LoRA delta + "
"per-feature encoder + task heads. The LFM2.5 base is not included; "
"it is loaded once per serving GPU from the public weights. At "
"LFM2.5-350M the artifact is ~190 MB in bf16; the unstripped version "
"including the base would be ~900 MB. Multi-tenant serving keeps "
"one backbone resident on the GPU and switches the active LoRA "
"delta + encoder per request, so a new customer adds a small "
"adapter rather than a second foundation model. The conv-dominant "
"backbone quantizes cleanly to INT8 for cost-sensitive deployments. "
"Same code path runs CPU or GPU.",
"Per-customer artifact: ~190 MB bf16 at LFM2.5-350M (LoRA + encoder + heads) | "
"Multi-tenant: shared backbone forward + per-request LoRA switching | "
"CPU (~5s/inference) or H100 (<100ms), same code | "
"INT8 quantization clean for conv layers"
)}
Configuration Choices That Look Right and Aren’t
{_gotcha("1", "Do not compress each transaction to one token",
"an MLP that averages the 15 feature embeddings into a single "
"vector destroys the intra-tx fraud signal. Fraud ROC-AUC "
"collapsed to 0.535. Keep the full T_tx*F stripe.")}
{_gotcha("2", "Pool the pre-last transaction, not the last, for next-tx heads",
"pooling the last-tx stripe for next-merchant prediction "
"leaks the prediction target into the input. Use the prior tx.")}
{_gotcha("3", "Don't unfreeze the backbone at typical label budgets",
"full-backbone unfreezing produces lower quality than frozen-plus-LoRA "
"(fraud ROC-AUC 0.951 → 0.900 on this demo's data). LoRA acts as "
"effective regularization; lifting it forces overfitting.")}
{_gotcha("4", "Tied embedding heads need SSL-pretrained value tables",
"tying the next-merchant head to the encoder's merchant value table "
"without self-supervised pretraining of those tables reduces "
"next-merchant top-1 from 7.78% to 3.74%. Use a fresh MLP head "
"until SSL pretraining anchors the value tables.")}
{_gotcha("5", "Average per-feature losses; do not sum",
"summing CE losses across features makes high-cardinality "
"features (10K-vocab merchant) dominate the gradient. The "
"low-cardinality features stop training.")}
{_gotcha("6", "Match the schema fingerprint between training and inference",
"if the tokenizer's vocab changes between training and "
"deployment, the encoder's value tables index into a different "
"semantic space. Embed the fingerprint in checkpoint metadata.")}
Typical Engagement
| Phase |
Duration |
What Happens |
| Discovery |
1-2 weeks |
Schema design, data sample (~100K-1M sequences), compliance review,
architectural fit assessment.
|
| POC |
1 week |
Fine-tune encoder + LoRA + heads on customer sample, measurement
report, go/no-go recommendation.
|
| Production |
2-3 months |
Customer engineering team builds with Liquid architectural support,
weekly design review, scale-up to LFM2.5-1.2B.
|
| Scale |
Ongoing |
Multi-task expansion (add heads), multi-tenant serving, retraining
cadence, architecture evolution.
|
"""