luis-otte's picture
Add developer usage examples to model card
1de807f verified
|
Raw
History Blame Contribute Delete
11.1 kB
---
license: apache-2.0
library_name: transformers
tags:
- json-repair
- schema-validation
- structured-output
- tool-calling
- agent-workflows
- code-t5
- constraint-dsl
base_model: Salesforce/codet5p-220m
pipeline_tag: text-generation
---
# StructFix
Schema-aware structured output recovery for LLMs and agent workflows.
- Recovers invalid structured outputs
- Repairs missing required fields
- Fixes enum violations
- Validates and repairs tool-call payloads
- Handles markdown-wrapped or text-wrapped JSON
- Lightweight: 220M parameters
**91.9% schema success on unseen schemas with randomized field names.**
StructFix is a CodeT5+ 220M model fine-tuned to repair broken structured outputs using **ConstraintDSL**, a compact schema representation designed for small language models.
![StructFix recovery flow](assets/structfix-flow.png)
## Problem
LLM and agent outputs often look almost correct but fail validation.
Input:
```json
{
"priority": "urgent"
}
```
Constraint:
```text
priority must be one of: low | medium | high
```
Output:
```json
{
"priority": "high"
}
```
## Quick Start
Install:
```bash
pip install transformers torch
```
Run inference:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_id = "ottema/structfix-codet5p-220m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
dsl = """FIELD priority TYPE string VALUES low|medium|high REQUIRED yes
FIELD description TYPE string REQUIRED yes"""
broken_output = """{
"priority": "urgent"
}"""
prompt = f"""TASK repair_structured_output
SPEC
{dsl}
BROKEN_OUTPUT
{broken_output}"""
inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
**inputs,
max_length=256,
num_beams=1,
do_sample=False,
)
repaired = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(repaired)
```
Example output:
```json
{"priority":"high","description":""}
```
## Developer Examples
### Reusable repair helper
```python
import json
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_id = "ottema/structfix-codet5p-220m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
def repair_structured_output(dsl: str, broken_output: str) -> dict:
prompt = f"""TASK repair_structured_output
SPEC
{dsl}
BROKEN_OUTPUT
{broken_output}"""
inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
**inputs,
max_length=256,
num_beams=1,
do_sample=False,
)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return json.loads(text)
```
Usage:
```python
dsl = """FIELD status TYPE string VALUES success|error|pending REQUIRED yes
FIELD result TYPE string REQUIRED yes"""
payload = '{"result":"Found 3 items"}'
print(repair_structured_output(dsl, payload))
```
Example output:
```json
{"status":"success","result":"Found 3 items"}
```
### Repair and validate against JSON Schema
Use StructFix as a recovery step, then validate with your normal validator.
```bash
pip install transformers torch jsonschema
```
```python
import jsonschema
schema = {
"type": "object",
"properties": {
"priority": {"type": "string", "enum": ["low", "medium", "high"]},
"description": {"type": "string"},
},
"required": ["priority", "description"],
}
dsl = """FIELD priority TYPE string VALUES low|medium|high REQUIRED yes
FIELD description TYPE string REQUIRED yes"""
broken = '{"priority":"urgent"}'
repaired = repair_structured_output(dsl, broken)
jsonschema.validate(instance=repaired, schema=schema)
print(repaired)
```
Example output:
```json
{"priority":"high","description":""}
```
### Repair an OpenAI-style tool call payload
```python
dsl = """TOOL create_ticket
ARG priority TYPE string VALUES low|medium|high REQUIRED yes
ARG description TYPE string REQUIRED yes
ARG customer_id TYPE integer REQUIRED no"""
broken_tool_call = """
create_ticket(priority="urgent", customer_id="42")
"""
print(repair_structured_output(dsl, broken_tool_call))
```
Example output:
```json
{"priority":"high","description":"","customer_id":42}
```
### Strip markdown and extra assistant text
````python
dsl = """FIELD user_id TYPE integer REQUIRED yes
FIELD username TYPE string REQUIRED yes
FIELD active TYPE boolean REQUIRED no"""
assistant_output = """Here is the JSON:
```json
{"user_id": "42", "username": "jdoe", "active": "true"}
```
"""
print(repair_structured_output(dsl, assistant_output))
````
Example output:
```json
{"user_id":42,"username":"jdoe","active":true}
```
### Compile JSON Schema to ConstraintDSL
This repository includes a reference compiler in `schema_compiler.py`. The core mapping is straightforward:
```python
def json_schema_to_dsl(schema: dict) -> str:
required = set(schema.get("required", []))
lines = []
for name, prop in schema.get("properties", {}).items():
typ = prop.get("type", "string")
enum = ""
if "enum" in prop:
enum = " VALUES " + "|".join(prop["enum"])
req = "yes" if name in required else "no"
lines.append(f"FIELD {name} TYPE {typ}{enum} REQUIRED {req}")
return "\n".join(lines)
```
Example:
```python
schema = {
"type": "object",
"properties": {
"priority": {"type": "string", "enum": ["low", "medium", "high"]},
"description": {"type": "string"},
},
"required": ["priority", "description"],
}
print(json_schema_to_dsl(schema))
```
Output:
```text
FIELD priority TYPE string VALUES low|medium|high REQUIRED yes
FIELD description TYPE string REQUIRED yes
```
## What It Repairs
| Category | Support |
| --- | :---: |
| Missing required fields | Yes |
| Invalid enums | Yes |
| Wrong types | Yes |
| Partial tool calls | Yes |
| Markdown-wrapped JSON | Yes |
| Extra text before or after JSON | Yes |
| Truncated objects and arrays | Yes |
| Python-like tool calls | Yes |
## When To Use It
Use StructFix when you have a schema or tool definition and need to recover a structured payload from an LLM, agent, ETL, or integration workflow.
Good fits:
- Agent tool-call argument repair
- JSON payload recovery before validation
- Enum and required-field correction
- Recovering JSON from assistant responses with prose or markdown
- Lightweight local repair before retrying an expensive model call
Not a good fit:
- Arbitrary data cleaning without a schema
- High-stakes financial, medical, legal, or regulatory corrections without human validation
- Inputs longer than the model context window
- Tasks where preserving every original field name is mandatory without post-validation
## ConstraintDSL
StructFix does not use raw JSON Schema directly at inference time. It expects a compact line-oriented schema format called ConstraintDSL.
Example:
```text
FIELD priority TYPE string VALUES low|medium|high REQUIRED yes
FIELD description TYPE string REQUIRED yes
FIELD customer_id TYPE integer REQUIRED no
```
Tool-call example:
```text
TOOL create_ticket
ARG priority TYPE string VALUES low|medium|high REQUIRED yes
ARG description TYPE string REQUIRED yes
ARG customer_id TYPE integer REQUIRED no
```
Model input format:
```text
TASK repair_structured_output
SPEC
FIELD priority TYPE string VALUES low|medium|high REQUIRED yes
FIELD description TYPE string REQUIRED yes
BROKEN_OUTPUT
{"priority":"urgent"}
```
ConstraintDSL exists because raw JSON Schema generalized poorly in this setup. With the same base model, data, and training procedure, ConstraintDSL improved unseen-schema schema success from **55.0%** to **96.3%**.
See [ConstraintDSL](https://huggingface.co/datasets/ottema/constraint-dsl) for the DSL specification and compiler references.
## Results
### Main benchmark
| Method | Schema Success |
| --- | :---: |
| json-repair | 65.2% |
| CodeT5+ + raw JSON Schema | 55.0% |
| **StructFix + ConstraintDSL** | **96.3%** |
| **StructFix + randomized fields** | **91.9%** |
### Schema representation ablation
| Test | Schema Success |
| --- | :---: |
| Raw JSON Schema | 55.0% |
| ConstraintDSL | 96.3% |
| Randomized field names | 91.9% |
### Per-corruption performance
Unseen schemas with random hex field names:
| Corruption | StructFix | json-repair |
| --- | :---: | :---: |
| `invalid_enum` | 96.4% | 0% |
| `missing_required` | 92.2% | 0% |
| `null_required` | 97.9% | 2.9% |
| `wrong_type` | 92.0% | 0% |
| `tool_call_partial_args` | 90.9% | 0% |
| `tool_call_python_syntax` | 90.0% | 0% |
| `tool_call_wrong_param` | 93.8% | 51.2% |
| `agent_chain` | 87.2% | 40.5% |
Latency in the benchmark was about **690 ms/example** for StructFix and **0.13 ms/example** for json-repair.
## Known Limitations
- Field names unseen during training may be substituted by semantically similar names.
- Synonym enum repair depends on lexical similarity and field-name semantics.
- The model is English-oriented in the current version.
- Maximum input length is 512 tokens.
- Always validate the output against your schema after inference.
- Not recommended for financial, medical, legal, or regulatory corrections without human review.
Example field-name substitutions observed in showcase validation:
| DSL field name | Model output |
| --- | --- |
| `action` | `active` |
| `records_processed` | `items_processed` |
| `contract_id` | `consign_id` |
| `to` | `strand` |
## Research Findings
- Raw JSON Schema generalized poorly for this 220M model: **55.0%** schema success.
- ConstraintDSL improved unseen-schema performance to **96.3%**.
- Randomized field names still achieved **91.9%**, suggesting the model uses explicit constraints rather than only memorized field semantics.
- Field names remain the most important DSL component in ablations.
Full benchmark details are available in [StructFix-Bench](https://huggingface.co/datasets/ottema/structfix-bench).
## Training Details
| Item | Value |
| --- | --- |
| Base model | `Salesforce/codet5p-220m` |
| Parameters | 220M |
| Training data | 200K synthetic examples |
| Format | ConstraintDSL |
| Epochs | 3 |
| Effective batch size | 32 |
| Learning rate | 2e-4 |
| Final eval loss | 0.056 |
| Field-name shuffling | 50% of training examples |
| Synthetic enums | 50% of training examples |
## Related Repositories
- [StructFix model](https://huggingface.co/ottema/structfix-codet5p-220m): this model card, focused on usage.
- [StructFix-Bench](https://huggingface.co/datasets/ottema/structfix-bench): dataset, benchmark splits, and ablations.
- [ConstraintDSL](https://huggingface.co/datasets/ottema/constraint-dsl): DSL specification and compiler references.
## Citation
```bibtex
@software{structfix_codet5p_220m,
title = {StructFix: Schema-Aware Structured Output Recovery with ConstraintDSL},
author = {Ottema},
year = {2026},
url = {https://huggingface.co/ottema/structfix-codet5p-220m}
}
```
## License
Apache-2.0. Check the Salesforce CodeT5+ base model license for compatibility with your use case.