Auden 12B

Auden turns receipts and invoices into structured tool calls. It is a LoRA fine-tune of Gemma 4 12B (instruction-tuned) that emits OpenAI-style tool_callsextract_fields with the requested values, or ask_follow_up when the document genuinely doesn't contain what you asked for — instead of free text you have to parse. It ships as GGUF for llama.cpp and runs fully offline on a single consumer GPU or Apple Silicon Mac. Built by NuroAI Labs.

On real receipts (CORD test, OCR text in), Auden v1 reaches 98.95% field-level extraction accuracy and 99% end-to-end success, versus 11.6% / 14% for stock Gemma 4 12B under the identical harness — with hallucinated fields cut from 245 (v0) to 17 and 100% abstention accuracy when a requested field is absent.

Table of contents

Model details

Base model google/gemma-4-12B-it (12B, instruction-tuned, Apache 2.0)
Fine-tuning LoRA SFT via Unsloth — r=16, alpha=32, attention + MLP projections (text and vision towers)
Task Document field extraction as native Gemma 4 function calls
Input OCR text of a receipt/invoice (see Limitations re: image input)
Output Structured tool_calls (finish_reason: tool_calls)
Formats GGUF Q4_K_M (7.4 GB) and Q8_0 (12.7 GB), each with a BF16 vision projector (mmproj)
Runtime llama.cpp llama-server with --jinja (chat template + tool grammar embedded in the GGUF)
Languages Trained/evaluated on English, Indonesian, Malay, Turkish receipt locales
Author NuroAI Labs

Tool interface

Auden is trained against a 15-tool document-agent schema rendered through Gemma 4's native <|tool> declarations. The evaluated core:

  • extract_fields(schema, doc_ref?) — return requested fields from the active document
  • ask_follow_up(question, missing_fields?) — abstain and ask when a requested field is not in the document
  • no_action_needed(reason, confidence?) — decline when no tool action is warranted

The wider schema also covers query_document, classify_doc, extract_table, summarize, redact, compare_docs, convert, fill_form, search_folder, create_event_from_doc, set_reminder, and draft_email. Published benchmarks exercise extract_fields and ask_follow_up.

Intended uses & limitations

Intended uses

  • Local/offline receipt and invoice field extraction (totals, dates, merchants, currencies) feeding bookkeeping, expense, or archival pipelines.
  • Agent stacks that need machine-parseable extraction with an explicit abstention path instead of best-guess free text.

Limitations

  • Feed OCR text, not images. Image-in extraction through the quantized llama.cpp multimodal path scores poorly for this model family (the unquantized bf16 base reads the same receipts correctly, so this is a quantized-runtime issue, not a model-capability ceiling). Run your own OCR step and pass the text.
  • Evaluation covers Indonesian-market CORD receipts and synthetic receipts/invoices in four locales. Long multi-page invoices, handwriting, and very different document styles are uncharacterized.
  • Published numbers assume the Auden system prompt and tool schemas (below). The model remains a general Gemma under other prompts, but the benchmarks don't transfer.
  • Extraction confidence is not calibrated; the abstention behavior covers missing fields, not uncertain reads. Keep a human in the loop where errors are costly.
  • Inherits the biases and failure modes of Gemma 4 12B.

How to use

Serve (llama.cpp)

llama-server -m auden-12b-Q4_K_M.gguf --mmproj mmproj-BF16.gguf \
  --jinja -ngl 999 --host 127.0.0.1 --port 8089

--jinja is required — it activates the embedded chat template, which renders tool declarations into Gemma 4's native function-calling format.

Call (OpenAI-compatible)

import requests

TOOLS = [{
    "type": "function",
    "function": {
        "name": "extract_fields",
        "description": "Extract the requested fields from the active document.",
        "parameters": {
            "type": "object",
            "properties": {
                "fields": {"type": "object", "description": "Field name to extracted value."},
                "schema": {"type": "object", "description": "Requested field schema."},
            },
        },
    },
}, {
    "type": "function",
    "function": {
        "name": "ask_follow_up",
        "description": "Ask the user a question when required information is missing.",
        "parameters": {
            "type": "object",
            "properties": {
                "question": {"type": "string"},
                "missing_fields": {"type": "array", "items": {"type": "string"}},
            },
            "required": ["question"],
        },
    },
}]

payload = {
    "model": "auden",
    "messages": [
        {"role": "system", "content": "You are Auden, a precise local document agent. "
                                       "Use tools only when the document evidence supports them."},
        {"role": "user", "content": "OCR text from the active document:\n"
                                    "RECEIPT\nCorner Cafe\nDate: 2025-03-02\n"
                                    "Coffee 2 x 3.50 7.00\nTotal: 7.00\n\nExtract total."},
    ],
    "tools": TOOLS,
    "temperature": 0,
}
r = requests.post("http://127.0.0.1:8089/v1/chat/completions", json=payload).json()
print(r["choices"][0]["message"]["tool_calls"])
# -> [{"function": {"name": "extract_fields", "arguments": "{\"fields\":{\"total\":\"7.00\"}, ...}"}}]

If the document lacks the requested field, the model calls ask_follow_up with missing_fields instead of inventing a value.

Evaluation

All three models below were run through the identical harness: same prompts, same tool schemas, same normalized scorer (auden-docbench-v1-normalized), same llama.cpp build (commit 76da2450), all at Q4_K_M, OCR-text input, greedy decoding. Full commands in repro_commands.md.

  • CORD 100 — 100 real receipts from the CORD-v2 test split (held out from training; overlap-checked).
  • Synthetic 500 — 500 held-out synthetic receipts/invoices across en/id/ms/tr locales.
  • Abstention 120 — 120 documents where the requested field is absent; correct behavior is ask_follow_up.
Eval Metric Stock Gemma 4 12B Auden v0 Auden v1 (this release)
CORD 100 (real receipts) Tool-call accuracy 0.5900 1.0000 1.0000
CORD 100 (real receipts) Field/argument match 0.1158 0.8526 0.9895
CORD 100 (real receipts) End-to-end success 0.1400 0.8600 0.9900
Synthetic 500 Tool-call accuracy 0.9140 1.0000 1.0000
Synthetic 500 Field/argument match 0.0900 0.9588 0.9638
Synthetic 500 End-to-end success 0.0080 0.8700 0.8940
Abstention 120 Abstention accuracy 0.9833 0.2833 1.0000

Field-level failure events on CORD 100 (same analyzer):

Category Stock Gemma Auden v0 Auden v1
Hallucinated field 85 245 17
Missing field 84 0 1
Wrong value 0 14 0
Formatting mismatch 1 8 2

Reading the baseline honestly. Stock Gemma fails to emit any valid extract_fields call on 41/100 real receipts; when it does call, arguments are usually schema-echoes or partial. Its low hallucination count reflects failing to extract at all (84 missing-field events), not precision. Its abstention score is genuinely high (0.9833) because the eval prompt explicitly instructs ask-if-missing behavior and the instruction-tuned base follows instructions — the fine-tune's value is concentrated where it matters: extraction accuracy and hallucination control. v0's abstention regression (0.2833) was fixed in v1 (1.0000).

Training

  • Data: 13,558 tool-call traces — synthetic receipts/invoices (procedural generator with missing-total, occlusion, scan-noise, multi-currency, and wrong-document negatives; en/id/ms/tr locales), ~8% abstention traces, ~11% image-grounded traces, ~11% real-receipt traces. Zero overlap with all three eval sets (checked by document ID).
  • Procedure: LoRA SFT with Unsloth on Gemma 4 12B-it — r=16, alpha=32, dropout 0, targets q/k/v/o + gate/up/down projections in both text and vision towers. 2,500 steps, lr 2e-4, effective batch 4 (1 × 4 grad-accum), seed 3407, ~0.74 epochs, 91 minutes on 1× H100. Final train loss 0.393.
  • Validation gates: every trace passed an argument-match validator (pass rate 1.0); training format locked to the official processor's apply_chat_template/parse_response round-trip.
  • Export: LoRA merged into the bf16 base, converted to GGUF via llama.cpp (convert_hf_to_gguf), quantized to Q4_K_M and Q8_0.

Files & checksums

File Quant Size SHA256
q4_k_m/auden-12b-Q4_K_M.gguf Q4_K_M 7.4 GB 8522b4f696abc5721342f9bad6caeab73659aaf822d4a69931affc222c6fca6f
q4_k_m/mmproj-BF16.gguf BF16 mmproj 175 MB df7d0454dfb7a31d02a3845478f655b991c16f5d9b76cd308a96bc2d07c17f7e
q8_0/auden-12b-Q8_0.gguf Q8_0 12.7 GB 156295ec4ebe3eee54a757b194a2bf565d88340c5e9f2b5a5f88b5614fac920c
q8_0/mmproj-BF16.gguf BF16 mmproj 175 MB 8e198e532d9c7720232768947373e519d9bd59b7e011364340a90ed3f7049110

Q4_K_M is the evaluated release artifact (all published numbers). Q8_0 is provided for users who want headroom; it was not separately benchmarked. Machine-readable checksums: checksums.txt.

License & attribution

  • License: Apache 2.0. Auden is a derivative of Gemma 4 12B by Google, released under the Apache License 2.0. See NOTICE. "Gemma" identifies the base model; this project is not affiliated with or endorsed by Google.
  • CORD evaluation data: naver-clova-ix/cord-v2, CC-BY-4.0 — Park et al., "CORD: A Consolidated Receipt Dataset for Post-OCR Parsing" (Clova AI, NAVER). No CORD data is redistributed in this repository.
  • Tooling: llama.cpp (MIT), Unsloth (Apache 2.0).

Citation

@misc{auden12b2026,
  title     = {Auden 12B: Receipt and Invoice Extraction as Structured Tool Calls},
  author    = {{NuroAI Labs}},
  year      = {2026},
  url       = {https://huggingface.co/nuroai/auden-12b},
  publisher = {NuroAI Labs, https://nuroailabs.com}
}
Downloads last month
201
GGUF
Model size
12B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nuroai/auden-12b

Finetuned
(100)
this model

Dataset used to train nuroai/auden-12b