---
license: apache-2.0
library_name: transformers
tags:
  - json-repair
  - schema-validation
  - structured-output
  - tool-calling
  - agent-workflows
  - code-t5
  - constraint-dsl
base_model: Salesforce/codet5p-220m
pipeline_tag: text-generation
---

# StructFix

Schema-aware structured output recovery for LLMs and agent workflows.

- Recovers invalid structured outputs
- Repairs missing required fields
- Fixes enum violations
- Validates and repairs tool-call payloads
- Handles markdown-wrapped or text-wrapped JSON
- Lightweight: 220M parameters

**91.9% schema success on unseen schemas with randomized field names.**

StructFix is a CodeT5+ 220M model fine-tuned to repair broken structured outputs using **ConstraintDSL**, a compact schema representation designed for small language models.

![StructFix recovery flow](assets/structfix-flow.png)

## Problem

LLM and agent outputs often look almost correct but fail validation.

Input:

```json
{
  "priority": "urgent"
}
```

Constraint:

```text
priority must be one of: low | medium | high
```

Output:

```json
{
  "priority": "high"
}
```

## Quick Start

Install:

```bash
pip install transformers torch
```

Run inference:

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "ottema/structfix-codet5p-220m"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

dsl = """FIELD priority TYPE string VALUES low|medium|high REQUIRED yes
FIELD description TYPE string REQUIRED yes"""

broken_output = """{
  "priority": "urgent"
}"""

prompt = f"""TASK repair_structured_output

SPEC
{dsl}

BROKEN_OUTPUT
{broken_output}"""

inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
    **inputs,
    max_length=256,
    num_beams=1,
    do_sample=False,
)

repaired = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(repaired)
```

Example output:

```json
{"priority":"high","description":""}
```

## Developer Examples

### Reusable repair helper

```python
import json
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "ottema/structfix-codet5p-220m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)


def repair_structured_output(dsl: str, broken_output: str) -> dict:
    prompt = f"""TASK repair_structured_output

SPEC
{dsl}

BROKEN_OUTPUT
{broken_output}"""

    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(
        **inputs,
        max_length=256,
        num_beams=1,
        do_sample=False,
    )
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return json.loads(text)
```

Usage:

```python
dsl = """FIELD status TYPE string VALUES success|error|pending REQUIRED yes
FIELD result TYPE string REQUIRED yes"""

payload = '{"result":"Found 3 items"}'

print(repair_structured_output(dsl, payload))
```

Example output:

```json
{"status":"success","result":"Found 3 items"}
```

### Repair and validate against JSON Schema

Use StructFix as a recovery step, then validate with your normal validator.

```bash
pip install transformers torch jsonschema
```

```python
import jsonschema

schema = {
    "type": "object",
    "properties": {
        "priority": {"type": "string", "enum": ["low", "medium", "high"]},
        "description": {"type": "string"},
    },
    "required": ["priority", "description"],
}

dsl = """FIELD priority TYPE string VALUES low|medium|high REQUIRED yes
FIELD description TYPE string REQUIRED yes"""

broken = '{"priority":"urgent"}'
repaired = repair_structured_output(dsl, broken)

jsonschema.validate(instance=repaired, schema=schema)
print(repaired)
```

Example output:

```json
{"priority":"high","description":""}
```

### Repair an OpenAI-style tool call payload

```python
dsl = """TOOL create_ticket
ARG priority TYPE string VALUES low|medium|high REQUIRED yes
ARG description TYPE string REQUIRED yes
ARG customer_id TYPE integer REQUIRED no"""

broken_tool_call = """
create_ticket(priority="urgent", customer_id="42")
"""

print(repair_structured_output(dsl, broken_tool_call))
```

Example output:

```json
{"priority":"high","description":"","customer_id":42}
```

### Strip markdown and extra assistant text

````python
dsl = """FIELD user_id TYPE integer REQUIRED yes
FIELD username TYPE string REQUIRED yes
FIELD active TYPE boolean REQUIRED no"""

assistant_output = """Here is the JSON:

```json
{"user_id": "42", "username": "jdoe", "active": "true"}
```
"""

print(repair_structured_output(dsl, assistant_output))
````

Example output:

```json
{"user_id":42,"username":"jdoe","active":true}
```

### Compile JSON Schema to ConstraintDSL

This repository includes a reference compiler in `schema_compiler.py`. The core mapping is straightforward:

```python
def json_schema_to_dsl(schema: dict) -> str:
    required = set(schema.get("required", []))
    lines = []

    for name, prop in schema.get("properties", {}).items():
        typ = prop.get("type", "string")
        enum = ""
        if "enum" in prop:
            enum = " VALUES " + "|".join(prop["enum"])
        req = "yes" if name in required else "no"
        lines.append(f"FIELD {name} TYPE {typ}{enum} REQUIRED {req}")

    return "\n".join(lines)
```

Example:

```python
schema = {
    "type": "object",
    "properties": {
        "priority": {"type": "string", "enum": ["low", "medium", "high"]},
        "description": {"type": "string"},
    },
    "required": ["priority", "description"],
}

print(json_schema_to_dsl(schema))
```

Output:

```text
FIELD priority TYPE string VALUES low|medium|high REQUIRED yes
FIELD description TYPE string REQUIRED yes
```

## What It Repairs

| Category | Support |
| --- | :---: |
| Missing required fields | Yes |
| Invalid enums | Yes |
| Wrong types | Yes |
| Partial tool calls | Yes |
| Markdown-wrapped JSON | Yes |
| Extra text before or after JSON | Yes |
| Truncated objects and arrays | Yes |
| Python-like tool calls | Yes |

## When To Use It

Use StructFix when you have a schema or tool definition and need to recover a structured payload from an LLM, agent, ETL, or integration workflow.

Good fits:

- Agent tool-call argument repair
- JSON payload recovery before validation
- Enum and required-field correction
- Recovering JSON from assistant responses with prose or markdown
- Lightweight local repair before retrying an expensive model call

Not a good fit:

- Arbitrary data cleaning without a schema
- High-stakes financial, medical, legal, or regulatory corrections without human validation
- Inputs longer than the model context window
- Tasks where preserving every original field name is mandatory without post-validation

## ConstraintDSL

StructFix does not use raw JSON Schema directly at inference time. It expects a compact line-oriented schema format called ConstraintDSL.

Example:

```text
FIELD priority TYPE string VALUES low|medium|high REQUIRED yes
FIELD description TYPE string REQUIRED yes
FIELD customer_id TYPE integer REQUIRED no
```

Tool-call example:

```text
TOOL create_ticket
ARG priority TYPE string VALUES low|medium|high REQUIRED yes
ARG description TYPE string REQUIRED yes
ARG customer_id TYPE integer REQUIRED no
```

Model input format:

```text
TASK repair_structured_output

SPEC
FIELD priority TYPE string VALUES low|medium|high REQUIRED yes
FIELD description TYPE string REQUIRED yes

BROKEN_OUTPUT
{"priority":"urgent"}
```

ConstraintDSL exists because raw JSON Schema generalized poorly in this setup. With the same base model, data, and training procedure, ConstraintDSL improved unseen-schema schema success from **55.0%** to **96.3%**.

See [ConstraintDSL](https://huggingface.co/datasets/ottema/constraint-dsl) for the DSL specification and compiler references.

## Results

### Main benchmark

| Method | Schema Success |
| --- | :---: |
| json-repair | 65.2% |
| CodeT5+ + raw JSON Schema | 55.0% |
| **StructFix + ConstraintDSL** | **96.3%** |
| **StructFix + randomized fields** | **91.9%** |

### Schema representation ablation

| Test | Schema Success |
| --- | :---: |
| Raw JSON Schema | 55.0% |
| ConstraintDSL | 96.3% |
| Randomized field names | 91.9% |

### Per-corruption performance

Unseen schemas with random hex field names:

| Corruption | StructFix | json-repair |
| --- | :---: | :---: |
| `invalid_enum` | 96.4% | 0% |
| `missing_required` | 92.2% | 0% |
| `null_required` | 97.9% | 2.9% |
| `wrong_type` | 92.0% | 0% |
| `tool_call_partial_args` | 90.9% | 0% |
| `tool_call_python_syntax` | 90.0% | 0% |
| `tool_call_wrong_param` | 93.8% | 51.2% |
| `agent_chain` | 87.2% | 40.5% |

Latency in the benchmark was about **690 ms/example** for StructFix and **0.13 ms/example** for json-repair.

## Known Limitations

- Field names unseen during training may be substituted by semantically similar names.
- Synonym enum repair depends on lexical similarity and field-name semantics.
- The model is English-oriented in the current version.
- Maximum input length is 512 tokens.
- Always validate the output against your schema after inference.
- Not recommended for financial, medical, legal, or regulatory corrections without human review.

Example field-name substitutions observed in showcase validation:

| DSL field name | Model output |
| --- | --- |
| `action` | `active` |
| `records_processed` | `items_processed` |
| `contract_id` | `consign_id` |
| `to` | `strand` |

## Research Findings

- Raw JSON Schema generalized poorly for this 220M model: **55.0%** schema success.
- ConstraintDSL improved unseen-schema performance to **96.3%**.
- Randomized field names still achieved **91.9%**, suggesting the model uses explicit constraints rather than only memorized field semantics.
- Field names remain the most important DSL component in ablations.

Full benchmark details are available in [StructFix-Bench](https://huggingface.co/datasets/ottema/structfix-bench).

## Training Details

| Item | Value |
| --- | --- |
| Base model | `Salesforce/codet5p-220m` |
| Parameters | 220M |
| Training data | 200K synthetic examples |
| Format | ConstraintDSL |
| Epochs | 3 |
| Effective batch size | 32 |
| Learning rate | 2e-4 |
| Final eval loss | 0.056 |
| Field-name shuffling | 50% of training examples |
| Synthetic enums | 50% of training examples |

## Related Repositories

- [StructFix model](https://huggingface.co/ottema/structfix-codet5p-220m): this model card, focused on usage.
- [StructFix-Bench](https://huggingface.co/datasets/ottema/structfix-bench): dataset, benchmark splits, and ablations.
- [ConstraintDSL](https://huggingface.co/datasets/ottema/constraint-dsl): DSL specification and compiler references.

## Citation

```bibtex
@software{structfix_codet5p_220m,
  title = {StructFix: Schema-Aware Structured Output Recovery with ConstraintDSL},
  author = {Ottema},
  year = {2026},
  url = {https://huggingface.co/ottema/structfix-codet5p-220m}
}
```

## License

Apache-2.0. Check the Salesforce CodeT5+ base model license for compatibility with your use case.