Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects
Paper
•
2512.12818
•
Published
•
1
A quantized version of dslim/distilbert-NER optimized for efficient named entity recognition. This model uses Quantization-Aware Training (QAT) with INT8 dynamic activation and INT4 weight quantization, exported to ONNX format for production deployment.
This model is a quantized version of the DistilBERT NER model, fine-tuned on the CoNLL-2003 dataset for named entity recognition. The quantization preserves accuracy while significantly reducing model size and inference latency.
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer
import json
class NEROnnxRunner:
def __init__(self, model_dir: str):
with open(f"{model_dir}/config.json", "r") as f:
self.id2label = {int(k): v for k, v in json.load(f).items()}
self.tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=True)
self.session = ort.InferenceSession(
f"{model_dir}/model.onnx", providers=["CPUExecutionProvider"]
)
def predict(self, text: str) -> list[dict]:
encoding = self.tokenizer(
text,
return_tensors="np",
padding="max_length",
truncation=True,
max_length=128,
return_offsets_mapping=True,
)
ort_inputs = {
"input_ids": encoding["input_ids"].astype(np.int64),
"attention_mask": encoding["attention_mask"].astype(np.int64),
}
logits = self.session.run(None, ort_inputs)[0]
predictions = np.argmax(logits, axis=2)[0]
batch_encoding = self.tokenizer(
text, truncation=True, max_length=128, return_offsets_mapping=True
)
offsets = batch_encoding["offset_mapping"]
entities = []
for idx, (start, end) in enumerate(offsets):
if start == end:
continue
label = self.id2label.get(predictions[idx], "O")
if label != "O":
entities.append({
"word": text[start:end],
"label": label,
"start": start,
"end": end,
})
return entities
runner = NEROnnxRunner("./distilbert-ner-qat-int4")
entities = runner.predict("Apple Inc. is based in Cupertino, California.")
# [{'word': 'Apple Inc.', 'label': 'B-ORG'}, {'word': 'Cupertino', 'label': 'B-LOC'}, {'word': 'California', 'label': 'B-LOC'}]
| Label | Description |
|---|---|
| PER | Person names |
| ORG | Organizations, companies, institutions |
| LOC | Locations, cities, countries |
| MISC | Miscellaneous entities (events, products, etc.) |
B-X: Beginning of entity type XI-X: Inside entity type XO: Outside any entityThe model was calibrated using 300 samples from the AG News dataset for QAT observer statistics.
| Parameter | Value |
|---|---|
| Base Model | dslim/distilbert-NER |
| Weight Quantization | INT4 |
| Activation Quantization | INT8 (dynamic) |
| Calibration Samples | 300 |
| Max Sequence Length | 128 |
| Export Format | ONNX (opset 18) |
| Metric | Score |
|---|---|
| Row Recall | 90.00% |
| Full Recall Rows | 27/30 |
Row Recall measures the percentage of samples where all expected entities were successfully extracted.
| Text | Entities |
|---|---|
| Sam Altman returned to OpenAI as CEO | Sam Altman (PER), OpenAI (ORG) |
| The European Central Bank announced... | European Central Bank (ORG), Frankfurt (LOC) |
| Barack Obama gave a speech at COP26 in Glasgow | Barack Obama (PER), COP26 (MISC), Glasgow (LOC) |
| Microsoft agreed to acquire Activision Blizzard | Microsoft (ORG), Activision Blizzard (ORG) |
Apache 2.0 License