File size: 7,723 Bytes
289453a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d01c16d
289453a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d01c16d
289453a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cee33c8
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
language:
- en
- it
- fr
- es
- de
- pt
tags:
- temporal-normalization
- byt5
- onnx
- medical
---

# Semplifica T5 Temporal Normalizer

## Model Description

**Semplifica T5 Temporal Normalizer** is a fine-tuned version of Google's [ByT5-Small](https://huggingface.co/google/byt5-small) specifically designed to solve a complex NLP problem: **normalizing noisy, slang, relative, and incomplete temporal expressions** into standard ISO formats (`YYYY-MM-DD` or `HH:MM`).

By operating at the character level (UTF-8 bytes), ByT5 is intrinsically immune to typos, dirty OCR outputs, and Out-Of-Vocabulary (OOV) tokens, making it exceptionally reliable for real-world, messy documents.

The model expects an **Anchor Date** (reference date), an optional **Language Code**, and the **Temporal String** as input:
> Input format: `YYYY-MM-DD | lang (optional) | input_text`

## Use Cases

1. **Clinical & Medical (EHR) — Primary:** Extract precise timelines from Electronic Health Records where doctors use extreme abbreviations ("3 days post-op", "admission + 2").
2. **Legal & Compliance:** Analyze legal contracts with relative deadlines ("within 30 days from signature").
3. **Conversational AI & Booking:** Chatbots processing user requests like "book a flight for next Tuesday afternoon".
4. **Logistics & Supply Chain:** Parsing informal shipping emails ("expected delivery in 2 days").

## Hardware Portability & ONNX

A core goal of this model is **universal portability**. It has been exported to **ONNX** in three precision formats:

| Format | Size | Notes |
|--------|------|-------|
| FP32 | ~1.14 GB | Full precision (Encoder + Decoder separated), validation reference |
| FP16 | ~738 MB | Half precision, ideal for GPU/NPU with Tensor Cores |
| INT8 | ~290 MB | Symmetric per-tensor weight quantization (~75% reduction vs FP32), ideal for CPU / Edge / Rust |

## Evaluation Metrics (ONNX Runtime)

Tested on GPU (CUDAExecutionProvider) using a 1,000 records evaluation sample:

| Model Format | Size | Exact Match Accuracy | F1 (Macro) | Throughput (samples/s) |
|--------------|------|----------------------|------------|------------------------|
| **FP32** | ~1.14 GB | 99.40% | 99.53% | ~44.0 |
| **FP16** | ~738 MB | 99.40% | 99.53% | ~39.8 |
| **INT8** | ~290 MB | 99.40% | 99.53% | ~31.7 |

---

## Usage in Python (HuggingFace Transformers)

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "SemplificaAI/t5-temporal-normalizer"
# Important: always load the tokenizer from the base model to avoid a known
# ByT5 tokenizer serialization bug in transformers >= 5.x
tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# Format: YYYY-MM-DD | lang (optional) | text
input_text = "2024-01-01 | en | 3 days post admission"
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_length=16)
# Use skip_special_tokens=False + manual cleanup to avoid a deadlock bug
# in transformers >= 5.x with skip_special_tokens=True
result = tokenizer.decode(outputs[0], skip_special_tokens=False)
result = result.replace("<pad>", "").replace("</s>", "").strip()

print(result)
# Output: 2024-01-04
```

## Usage in Python (ONNX Runtime)

```python
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
opts = ort.SessionOptions()
enc_sess = ort.InferenceSession("byt5_encoder_int8.onnx", sess_opts=opts, providers=["CPUExecutionProvider"])
dec_sess = ort.InferenceSession("byt5_decoder_int8.onnx", sess_opts=opts, providers=["CPUExecutionProvider"])

input_text = "2024-01-01 | en | 3 days post admission"
enc = tokenizer(input_text, return_tensors="np", max_length=64, padding="max_length", truncation=True)

# 1. Encoder forward pass
enc_hs = enc_sess.run(None, {
    "input_ids": enc["input_ids"],
    "attention_mask": enc["attention_mask"],
})[0]

# 2. Autoregressive greedy decode loop
MAX_OUT_LEN = 16
PAD_ID = 0
EOS_ID = 1

cur_ids = np.zeros((1, MAX_OUT_LEN), dtype=np.int64)
cur_mask = np.zeros((1, MAX_OUT_LEN), dtype=np.int64)
cur_ids[0, 0] = PAD_ID
cur_mask[0, 0] = 1

generated = []

for step in range(MAX_OUT_LEN - 1):
    logits = dec_sess.run(None, {
        "decoder_input_ids": cur_ids,
        "decoder_attention_mask": cur_mask,
        "encoder_hidden_states": enc_hs,
        "encoder_attention_mask": enc["attention_mask"],
    })[0]
    
    next_tok = int(np.argmax(logits[0, step]))
    if next_tok == EOS_ID:
        break
    generated.append(next_tok)
    
    cur_ids[0, step + 1] = next_tok
    cur_mask[0, step + 1] = 1

output_text = bytes([t - 3 for t in generated if t >= 3]).decode("utf-8", errors="ignore")
print("Prediction:", output_text)
```

## Usage in Go (ONNX Runtime)

A highly optimized Go evaluation pipeline is available in the `go_eval` directory, demonstrating the separation of Encoder and Decoder execution with pre-allocated tensors and fixed sequence padding (`MAX_OUT_LEN = 16`). It supports fallback to `CUDAExecutionProvider`.

```go
package main

import (
"fmt"
ort "github.com/yalue/onnxruntime_go"
)

func main() {
ort.SetSharedLibraryPath("libonnxruntime.so")
ort.InitializeEnvironment()
defer ort.DestroyEnvironment()

// Load separated ONNX models
encSess, _ := ort.NewAdvancedSession("byt5_encoder_fp32.onnx", /* ... */)
decSess, _ := ort.NewAdvancedSession("byt5_decoder_fp32.onnx", /* ... */)

// 1. Encoder pass
_ = encSess.Run()

// 2. Decoder autoregressive loop with fixed mask
for step := 0; step < 15; step++ {
_ = decSess.Run()
// Get step logits, argmax, and update input buffer
}
}
```

## Usage in Rust (ONNX Runtime)

For production environments, use the [`ort`](https://github.com/pykeio/ort) crate. Since T5 is an encoder-decoder architecture, generation requires an autoregressive loop.

```toml
# Cargo.toml
[dependencies]
ort = "2.0"
```

```rust
use ort::{GraphOptimizationLevel, Session};

fn main() -> ort::Result<()> {
    let session = Session::builder()?
        .with_optimization_level(GraphOptimizationLevel::Level3)?
        .with_intra_threads(4)?
        .commit_from_file("byt5_encoder_fp32.onnx")?;

    // ByT5 tokenization: each UTF-8 byte maps to token_id = byte + 3
    // (0=pad, 1=eos, 2=unk, then 3..258 = bytes 0..255)
    // Load both encoder and decoder sessions, then run autoregressive loop with fixed size padding

    Ok(())
}
```

## Technical Notes

- **ByT5 Tokenizer:** Each UTF-8 byte maps to `token_id = byte_value + 3`. Tokens 0/1/2 are PAD/EOS/UNK. Always load the tokenizer from `google/byt5-small` — the fine-tuned checkpoint may have a corrupted tokenizer config due to a known serialization bug in `transformers >= 5.x`.
- **ONNX Export:** Exported with `torch.onnx.export(dynamo=True)` + `onnxscript`. The old JIT tracer (`dynamo=False`) is incompatible with the new masking utilities in `transformers >= 5.x`.
- **INT8 Quantization:** Symmetric per-tensor quantization applied directly to the ONNX graph initializers (numpy-based). PyTorch `quantize_dynamic` models are not exportable via the dynamo exporter (`LinearPackedParamsBase` is not serializable by `torch.export`).
- **ONNX Architecture:** To overcome issues with ByT5 relative positional embeddings dynamically broadcasting at runtime, the model is exported as a **separated Encoder and Decoder**. The Decoder expects a fixed-length sequence of 16, which is updated sequentially using a padding mask during the autoregressive loop (see Python and Rust examples above).

## Author & Contact

- **Author:** Dario Finardi
- **Company:** [Semplifica](https://semplifica.ai)
- **Email:** hf@semplifica.ai