--- license: apache-2.0 library_name: transformers tags: - json-repair - schema-validation - structured-output - tool-calling - agent-workflows - code-t5 - constraint-dsl base_model: Salesforce/codet5p-220m pipeline_tag: text-generation --- # StructFix Schema-aware structured output recovery for LLMs and agent workflows. - Recovers invalid structured outputs - Repairs missing required fields - Fixes enum violations - Validates and repairs tool-call payloads - Handles markdown-wrapped or text-wrapped JSON - Lightweight: 220M parameters **91.9% schema success on unseen schemas with randomized field names.** StructFix is a CodeT5+ 220M model fine-tuned to repair broken structured outputs using **ConstraintDSL**, a compact schema representation designed for small language models. ![StructFix recovery flow](assets/structfix-flow.png) ## Problem LLM and agent outputs often look almost correct but fail validation. Input: ```json { "priority": "urgent" } ``` Constraint: ```text priority must be one of: low | medium | high ``` Output: ```json { "priority": "high" } ``` ## Quick Start Install: ```bash pip install transformers torch ``` Run inference: ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_id = "ottema/structfix-codet5p-220m" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSeq2SeqLM.from_pretrained(model_id) dsl = """FIELD priority TYPE string VALUES low|medium|high REQUIRED yes FIELD description TYPE string REQUIRED yes""" broken_output = """{ "priority": "urgent" }""" prompt = f"""TASK repair_structured_output SPEC {dsl} BROKEN_OUTPUT {broken_output}""" inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True) outputs = model.generate( **inputs, max_length=256, num_beams=1, do_sample=False, ) repaired = tokenizer.decode(outputs[0], skip_special_tokens=True) print(repaired) ``` Example output: ```json {"priority":"high","description":""} ``` ## Developer Examples ### Reusable repair helper ```python import json from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_id = "ottema/structfix-codet5p-220m" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSeq2SeqLM.from_pretrained(model_id) def repair_structured_output(dsl: str, broken_output: str) -> dict: prompt = f"""TASK repair_structured_output SPEC {dsl} BROKEN_OUTPUT {broken_output}""" inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True) outputs = model.generate( **inputs, max_length=256, num_beams=1, do_sample=False, ) text = tokenizer.decode(outputs[0], skip_special_tokens=True) return json.loads(text) ``` Usage: ```python dsl = """FIELD status TYPE string VALUES success|error|pending REQUIRED yes FIELD result TYPE string REQUIRED yes""" payload = '{"result":"Found 3 items"}' print(repair_structured_output(dsl, payload)) ``` Example output: ```json {"status":"success","result":"Found 3 items"} ``` ### Repair and validate against JSON Schema Use StructFix as a recovery step, then validate with your normal validator. ```bash pip install transformers torch jsonschema ``` ```python import jsonschema schema = { "type": "object", "properties": { "priority": {"type": "string", "enum": ["low", "medium", "high"]}, "description": {"type": "string"}, }, "required": ["priority", "description"], } dsl = """FIELD priority TYPE string VALUES low|medium|high REQUIRED yes FIELD description TYPE string REQUIRED yes""" broken = '{"priority":"urgent"}' repaired = repair_structured_output(dsl, broken) jsonschema.validate(instance=repaired, schema=schema) print(repaired) ``` Example output: ```json {"priority":"high","description":""} ``` ### Repair an OpenAI-style tool call payload ```python dsl = """TOOL create_ticket ARG priority TYPE string VALUES low|medium|high REQUIRED yes ARG description TYPE string REQUIRED yes ARG customer_id TYPE integer REQUIRED no""" broken_tool_call = """ create_ticket(priority="urgent", customer_id="42") """ print(repair_structured_output(dsl, broken_tool_call)) ``` Example output: ```json {"priority":"high","description":"","customer_id":42} ``` ### Strip markdown and extra assistant text ````python dsl = """FIELD user_id TYPE integer REQUIRED yes FIELD username TYPE string REQUIRED yes FIELD active TYPE boolean REQUIRED no""" assistant_output = """Here is the JSON: ```json {"user_id": "42", "username": "jdoe", "active": "true"} ``` """ print(repair_structured_output(dsl, assistant_output)) ```` Example output: ```json {"user_id":42,"username":"jdoe","active":true} ``` ### Compile JSON Schema to ConstraintDSL This repository includes a reference compiler in `schema_compiler.py`. The core mapping is straightforward: ```python def json_schema_to_dsl(schema: dict) -> str: required = set(schema.get("required", [])) lines = [] for name, prop in schema.get("properties", {}).items(): typ = prop.get("type", "string") enum = "" if "enum" in prop: enum = " VALUES " + "|".join(prop["enum"]) req = "yes" if name in required else "no" lines.append(f"FIELD {name} TYPE {typ}{enum} REQUIRED {req}") return "\n".join(lines) ``` Example: ```python schema = { "type": "object", "properties": { "priority": {"type": "string", "enum": ["low", "medium", "high"]}, "description": {"type": "string"}, }, "required": ["priority", "description"], } print(json_schema_to_dsl(schema)) ``` Output: ```text FIELD priority TYPE string VALUES low|medium|high REQUIRED yes FIELD description TYPE string REQUIRED yes ``` ## What It Repairs | Category | Support | | --- | :---: | | Missing required fields | Yes | | Invalid enums | Yes | | Wrong types | Yes | | Partial tool calls | Yes | | Markdown-wrapped JSON | Yes | | Extra text before or after JSON | Yes | | Truncated objects and arrays | Yes | | Python-like tool calls | Yes | ## When To Use It Use StructFix when you have a schema or tool definition and need to recover a structured payload from an LLM, agent, ETL, or integration workflow. Good fits: - Agent tool-call argument repair - JSON payload recovery before validation - Enum and required-field correction - Recovering JSON from assistant responses with prose or markdown - Lightweight local repair before retrying an expensive model call Not a good fit: - Arbitrary data cleaning without a schema - High-stakes financial, medical, legal, or regulatory corrections without human validation - Inputs longer than the model context window - Tasks where preserving every original field name is mandatory without post-validation ## ConstraintDSL StructFix does not use raw JSON Schema directly at inference time. It expects a compact line-oriented schema format called ConstraintDSL. Example: ```text FIELD priority TYPE string VALUES low|medium|high REQUIRED yes FIELD description TYPE string REQUIRED yes FIELD customer_id TYPE integer REQUIRED no ``` Tool-call example: ```text TOOL create_ticket ARG priority TYPE string VALUES low|medium|high REQUIRED yes ARG description TYPE string REQUIRED yes ARG customer_id TYPE integer REQUIRED no ``` Model input format: ```text TASK repair_structured_output SPEC FIELD priority TYPE string VALUES low|medium|high REQUIRED yes FIELD description TYPE string REQUIRED yes BROKEN_OUTPUT {"priority":"urgent"} ``` ConstraintDSL exists because raw JSON Schema generalized poorly in this setup. With the same base model, data, and training procedure, ConstraintDSL improved unseen-schema schema success from **55.0%** to **96.3%**. See [ConstraintDSL](https://huggingface.co/datasets/ottema/constraint-dsl) for the DSL specification and compiler references. ## Results ### Main benchmark | Method | Schema Success | | --- | :---: | | json-repair | 65.2% | | CodeT5+ + raw JSON Schema | 55.0% | | **StructFix + ConstraintDSL** | **96.3%** | | **StructFix + randomized fields** | **91.9%** | ### Schema representation ablation | Test | Schema Success | | --- | :---: | | Raw JSON Schema | 55.0% | | ConstraintDSL | 96.3% | | Randomized field names | 91.9% | ### Per-corruption performance Unseen schemas with random hex field names: | Corruption | StructFix | json-repair | | --- | :---: | :---: | | `invalid_enum` | 96.4% | 0% | | `missing_required` | 92.2% | 0% | | `null_required` | 97.9% | 2.9% | | `wrong_type` | 92.0% | 0% | | `tool_call_partial_args` | 90.9% | 0% | | `tool_call_python_syntax` | 90.0% | 0% | | `tool_call_wrong_param` | 93.8% | 51.2% | | `agent_chain` | 87.2% | 40.5% | Latency in the benchmark was about **690 ms/example** for StructFix and **0.13 ms/example** for json-repair. ## Known Limitations - Field names unseen during training may be substituted by semantically similar names. - Synonym enum repair depends on lexical similarity and field-name semantics. - The model is English-oriented in the current version. - Maximum input length is 512 tokens. - Always validate the output against your schema after inference. - Not recommended for financial, medical, legal, or regulatory corrections without human review. Example field-name substitutions observed in showcase validation: | DSL field name | Model output | | --- | --- | | `action` | `active` | | `records_processed` | `items_processed` | | `contract_id` | `consign_id` | | `to` | `strand` | ## Research Findings - Raw JSON Schema generalized poorly for this 220M model: **55.0%** schema success. - ConstraintDSL improved unseen-schema performance to **96.3%**. - Randomized field names still achieved **91.9%**, suggesting the model uses explicit constraints rather than only memorized field semantics. - Field names remain the most important DSL component in ablations. Full benchmark details are available in [StructFix-Bench](https://huggingface.co/datasets/ottema/structfix-bench). ## Training Details | Item | Value | | --- | --- | | Base model | `Salesforce/codet5p-220m` | | Parameters | 220M | | Training data | 200K synthetic examples | | Format | ConstraintDSL | | Epochs | 3 | | Effective batch size | 32 | | Learning rate | 2e-4 | | Final eval loss | 0.056 | | Field-name shuffling | 50% of training examples | | Synthetic enums | 50% of training examples | ## Related Repositories - [StructFix model](https://huggingface.co/ottema/structfix-codet5p-220m): this model card, focused on usage. - [StructFix-Bench](https://huggingface.co/datasets/ottema/structfix-bench): dataset, benchmark splits, and ablations. - [ConstraintDSL](https://huggingface.co/datasets/ottema/constraint-dsl): DSL specification and compiler references. ## Citation ```bibtex @software{structfix_codet5p_220m, title = {StructFix: Schema-Aware Structured Output Recovery with ConstraintDSL}, author = {Ottema}, year = {2026}, url = {https://huggingface.co/ottema/structfix-codet5p-220m} } ``` ## License Apache-2.0. Check the Salesforce CodeT5+ base model license for compatibility with your use case.