--- language: - en - it - fr - es - de - pt tags: - temporal-normalization - byt5 - onnx - medical --- # Semplifica T5 Temporal Normalizer ## Model Description **Semplifica T5 Temporal Normalizer** is a fine-tuned version of Google's [ByT5-Small](https://huggingface.co/google/byt5-small) specifically designed to solve a complex NLP problem: **normalizing noisy, slang, relative, and incomplete temporal expressions** into standard ISO formats (`YYYY-MM-DD` or `HH:MM`). By operating at the character level (UTF-8 bytes), ByT5 is intrinsically immune to typos, dirty OCR outputs, and Out-Of-Vocabulary (OOV) tokens, making it exceptionally reliable for real-world, messy documents. The model expects an **Anchor Date** (reference date), an optional **Language Code**, and the **Temporal String** as input: > Input format: `YYYY-MM-DD | lang (optional) | input_text` ## Use Cases 1. **Clinical & Medical (EHR) — Primary:** Extract precise timelines from Electronic Health Records where doctors use extreme abbreviations ("3 days post-op", "admission + 2"). 2. **Legal & Compliance:** Analyze legal contracts with relative deadlines ("within 30 days from signature"). 3. **Conversational AI & Booking:** Chatbots processing user requests like "book a flight for next Tuesday afternoon". 4. **Logistics & Supply Chain:** Parsing informal shipping emails ("expected delivery in 2 days"). ## Hardware Portability & ONNX A core goal of this model is **universal portability**. It has been exported to **ONNX** in three precision formats: | Format | Size | Notes | |--------|------|-------| | FP32 | ~1.14 GB | Full precision (Encoder + Decoder separated), validation reference | | FP16 | ~738 MB | Half precision, ideal for GPU/NPU with Tensor Cores | | INT8 | ~290 MB | Symmetric per-tensor weight quantization (~75% reduction vs FP32), ideal for CPU / Edge / Rust | ## Evaluation Metrics (ONNX Runtime) Tested on GPU (CUDAExecutionProvider) using a 1,000 records evaluation sample: | Model Format | Size | Exact Match Accuracy | F1 (Macro) | Throughput (samples/s) | |--------------|------|----------------------|------------|------------------------| | **FP32** | ~1.14 GB | 99.40% | 99.53% | ~44.0 | | **FP16** | ~738 MB | 99.40% | 99.53% | ~39.8 | | **INT8** | ~290 MB | 99.40% | 99.53% | ~31.7 | --- ## Usage in Python (HuggingFace Transformers) ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_id = "SemplificaAI/t5-temporal-normalizer" # Important: always load the tokenizer from the base model to avoid a known # ByT5 tokenizer serialization bug in transformers >= 5.x tokenizer = AutoTokenizer.from_pretrained("google/byt5-small") model = AutoModelForSeq2SeqLM.from_pretrained(model_id) # Format: YYYY-MM-DD | lang (optional) | text input_text = "2024-01-01 | en | 3 days post admission" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs, max_length=16) # Use skip_special_tokens=False + manual cleanup to avoid a deadlock bug # in transformers >= 5.x with skip_special_tokens=True result = tokenizer.decode(outputs[0], skip_special_tokens=False) result = result.replace("", "").replace("", "").strip() print(result) # Output: 2024-01-04 ``` ## Usage in Python (ONNX Runtime) ```python import onnxruntime as ort import numpy as np from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("google/byt5-small") opts = ort.SessionOptions() enc_sess = ort.InferenceSession("byt5_encoder_int8.onnx", sess_opts=opts, providers=["CPUExecutionProvider"]) dec_sess = ort.InferenceSession("byt5_decoder_int8.onnx", sess_opts=opts, providers=["CPUExecutionProvider"]) input_text = "2024-01-01 | en | 3 days post admission" enc = tokenizer(input_text, return_tensors="np", max_length=64, padding="max_length", truncation=True) # 1. Encoder forward pass enc_hs = enc_sess.run(None, { "input_ids": enc["input_ids"], "attention_mask": enc["attention_mask"], })[0] # 2. Autoregressive greedy decode loop MAX_OUT_LEN = 16 PAD_ID = 0 EOS_ID = 1 cur_ids = np.zeros((1, MAX_OUT_LEN), dtype=np.int64) cur_mask = np.zeros((1, MAX_OUT_LEN), dtype=np.int64) cur_ids[0, 0] = PAD_ID cur_mask[0, 0] = 1 generated = [] for step in range(MAX_OUT_LEN - 1): logits = dec_sess.run(None, { "decoder_input_ids": cur_ids, "decoder_attention_mask": cur_mask, "encoder_hidden_states": enc_hs, "encoder_attention_mask": enc["attention_mask"], })[0] next_tok = int(np.argmax(logits[0, step])) if next_tok == EOS_ID: break generated.append(next_tok) cur_ids[0, step + 1] = next_tok cur_mask[0, step + 1] = 1 output_text = bytes([t - 3 for t in generated if t >= 3]).decode("utf-8", errors="ignore") print("Prediction:", output_text) ``` ## Usage in Go (ONNX Runtime) A highly optimized Go evaluation pipeline is available in the `go_eval` directory, demonstrating the separation of Encoder and Decoder execution with pre-allocated tensors and fixed sequence padding (`MAX_OUT_LEN = 16`). It supports fallback to `CUDAExecutionProvider`. ```go package main import ( "fmt" ort "github.com/yalue/onnxruntime_go" ) func main() { ort.SetSharedLibraryPath("libonnxruntime.so") ort.InitializeEnvironment() defer ort.DestroyEnvironment() // Load separated ONNX models encSess, _ := ort.NewAdvancedSession("byt5_encoder_fp32.onnx", /* ... */) decSess, _ := ort.NewAdvancedSession("byt5_decoder_fp32.onnx", /* ... */) // 1. Encoder pass _ = encSess.Run() // 2. Decoder autoregressive loop with fixed mask for step := 0; step < 15; step++ { _ = decSess.Run() // Get step logits, argmax, and update input buffer } } ``` ## Usage in Rust (ONNX Runtime) For production environments, use the [`ort`](https://github.com/pykeio/ort) crate. Since T5 is an encoder-decoder architecture, generation requires an autoregressive loop. ```toml # Cargo.toml [dependencies] ort = "2.0" ``` ```rust use ort::{GraphOptimizationLevel, Session}; fn main() -> ort::Result<()> { let session = Session::builder()? .with_optimization_level(GraphOptimizationLevel::Level3)? .with_intra_threads(4)? .commit_from_file("byt5_encoder_fp32.onnx")?; // ByT5 tokenization: each UTF-8 byte maps to token_id = byte + 3 // (0=pad, 1=eos, 2=unk, then 3..258 = bytes 0..255) // Load both encoder and decoder sessions, then run autoregressive loop with fixed size padding Ok(()) } ``` ## Technical Notes - **ByT5 Tokenizer:** Each UTF-8 byte maps to `token_id = byte_value + 3`. Tokens 0/1/2 are PAD/EOS/UNK. Always load the tokenizer from `google/byt5-small` — the fine-tuned checkpoint may have a corrupted tokenizer config due to a known serialization bug in `transformers >= 5.x`. - **ONNX Export:** Exported with `torch.onnx.export(dynamo=True)` + `onnxscript`. The old JIT tracer (`dynamo=False`) is incompatible with the new masking utilities in `transformers >= 5.x`. - **INT8 Quantization:** Symmetric per-tensor quantization applied directly to the ONNX graph initializers (numpy-based). PyTorch `quantize_dynamic` models are not exportable via the dynamo exporter (`LinearPackedParamsBase` is not serializable by `torch.export`). - **ONNX Architecture:** To overcome issues with ByT5 relative positional embeddings dynamically broadcasting at runtime, the model is exported as a **separated Encoder and Decoder**. The Decoder expects a fixed-length sequence of 16, which is updated sequentially using a padding mask during the autoregressive loop (see Python and Rust examples above). ## Author & Contact - **Author:** Dario Finardi - **Company:** [Semplifica](https://semplifica.ai) - **Email:** hf@semplifica.ai