Teaching a 1B Model to Speak Puppet JSON

Community Article
Published June 14, 2026

Fine-tuning MiniCPM5-1B into a narrow Actor model for AI Puppet Theater.

This is a technical follow-up to AI Puppet Theater: From Premise to Puppet Show. The app is live on Hugging Face Spaces here: AI Puppet Theater Space.

In AI Puppet Theater, a user enters a short premise and the app turns it into a tiny puppet show. A Director agent orchestrates the scene, while Actor agents perform one beat at a time as puppet characters. Since this was built for the Build Small Hackathon, I wanted to see how far a small, specialized model could go when the role was narrow enough.

Part of the goal was personal learning too. I had tried fine-tuning before, but this was the first time I went through the full loop for my own app: generating the dataset, training the adapter, evaluating structured outputs, merging the model, converting it to GGUF, and testing local inference.

Why fine-tune the Actor?

The Actor in AI Puppet Theater has a narrow job: look at the current show state, follow the Director’s instruction, and return one short, theatrical JSON object for the next beat. It does not need to be a general assistant. It only needs to speak in character, update a small amount of memory, optionally request a theatrical tool, and stay inside a strict response schema.

Prompting helped, but it was not enough by itself. The app expected fields like intent, line, emotion, gesture, stage_effect, memory_update, and tool_request. When the model returned invalid JSON, missed a required key, added extra fields, or produced an invalid tool request, the runtime had to repair the response or fall back to deterministic behavior. That kept the show running, but it also made model-backed performance less reliable and less fluid.

That made the Actor a good fine-tuning target. The task was narrow, repeated, and easy to evaluate: given a puppet-show state, return one valid Actor JSON object. Instead of fine-tuning a model to become broadly smarter, the goal was to make a small model more dependable at one specific behavior inside the app.

The target output: one stage-ready JSON object

The target output for the Actor model was intentionally small. For each beat, the model had to return a single JSON object that the app could validate and apply to the stage.

{
  "intent": "react_to_event",
  "line": "I did not touch the moon-cheese lever... but I may have named it.",
  "emotion": "nervous",
  "gesture": "hides a tiny wrench behind their back",
  "stage_effect": "spotlight_flicker",
  "memory_update": "Pip is nervous about being blamed for the moon-cheese machine.",
  "tool_request": null
}

Each field had a job in the app. line was the spoken puppet line. emotion and gesture helped update the puppet card and performance style. stage_effect gave the stage something visual to react to. memory_update let the Actor carry a small amount of state across beats. tool_request allowed the Actor to ask for a theatrical tool, such as inspecting a prop, consulting the stage oracle, or changing the lighting.

This schema also made the model easier to evaluate. A response was not just “good” or “bad” in a vague way. I could check whether it parsed as JSON, whether it had the required keys, whether the line was short enough to speak, whether the tool request was valid, and whether the output could be safely used by the runtime.

Building the Actor SFT dataset

For the first version, I generated a synthetic Actor SFT dataset specifically for this app. Each example was framed around the same runtime pattern: there is a show state, an Actor state, a Director instruction, and the expected output is one valid Actor JSON object.

The dataset was not trying to teach general reasoning or open-ended chat. It was trying to teach the model the shape of the performance. The Actor should write a short, speakable line, stay in character, update memory when useful, and only request tools in the format the app understands.

The first dataset version had 1,400 examples split into train and validation JSONL files. I kept the format close to the app’s real prompts so the training task matched runtime behavior as much as possible. I also kept strict JSON behavior as a first-class requirement because the app could not safely use loose prose.

Training MiniCPM5-1B with LoRA

For the base model, I used openbmb/MiniCPM5-1B. It fit the Build Small Hackathon constraint and was small enough to experiment with while still being capable enough for a narrow structured-output role.

The first training target was a LoRA adapter, not a fully merged model. I trained the adapter on Modal using a CUDA GPU environment, with the Actor SFT train and validation files as input. The goal was to get a small adapter that could make the base model more reliable for this one task without needing to retrain or host a large model.

The training setup used a standard Hugging Face fine-tuning stack: transformers, peft, and TRL’s SFT flow. I used LoRA/QLoRA-style training so the experiment stayed lightweight. The first useful run trained for 2 epochs, finished in about 8 minutes, and reached a final eval loss around 0.1376 with token accuracy around 0.9496.

Evaluating structured output

Evaluation mattered because the output was not just text for a human to read. The app needed the model response to be machine-usable. So I evaluated the Actor model with checks that matched the runtime requirements.

The main questions were:

  • Does the response contain parseable JSON?
  • Does it include all required top-level fields?
  • Does it avoid extra or forbidden fields?
  • Is tool_request either null or a valid tool request?
  • Is the line short enough to be speakable in a puppet show?
  • Can the runtime sanitizer turn the output into something usable?

Here is the small eval summary I used while comparing the merged LoRA model and the final GGUF setup:

Check LoRA / merged model Final GGUF with llama.cpp
Eval prompts 40 40
Extractable JSON 35/40, 87.5% 39/40, 97.5%
Required fields present 34/40, 85.0% 39/40, 97.5%
Exact top-level schema 34/40, 85.0% 39/40, 97.5%
Sanitized Actor JSON usable 34/40, 85.0% 39/40, 97.5%
Strict tool_request valid 34/40, 85.0% 35/40, 87.5%
Sanitized tool_request usable 35/40, 87.5% 39/40, 97.5%

The first LoRA result was encouraging. It made the Actor much better at returning the expected JSON structure, but it was still not perfect. On the small 40-prompt eval set, the merged LoRA output produced extractable JSON in 35/40 cases and sanitized, usable Actor JSON in 34/40 cases.

The final GGUF setup did better after fixing the llama.cpp prompt and runtime flags. It produced extractable JSON in 39/40 cases and sanitized, usable Actor JSON in 39/40 cases. That made the local inference path feel much more realistic for the Actor role.

This was still usable because the app already had validation, repair, and fallback paths. The model did not need to be trusted blindly; it needed to be good enough to work with the runtime.

The v1 hardening attempt that got worse

After the first working version, I tried to create a harder dataset version to improve edge cases. The idea was reasonable: include stricter examples, more tool cases, and more schema pressure so the model would become even more reliable.

That attempt did not work as expected. The training loss looked fine, but the runtime behavior got worse. Some outputs drifted in small but important ways, especially around schema details and tool request shape. This was one of the most useful lessons from the project: a lower loss does not automatically mean the model is better for the app.

For this kind of fine-tuning, distribution matters a lot. If the training examples do not match the exact runtime contract, the model can learn something that looks close but still breaks the application. The eval script caught that, which made it easier to avoid shipping the worse version.

Merging to GGUF and testing with llama.cpp

After the LoRA adapter worked, I merged it into the base model and converted the result to GGUF for local inference. The goal was to test whether the Actor model could run in a more off-grid setup without relying only on hosted inference.

The GGUF path was useful, but it also exposed another layer of complexity. The first llama.cpp runs did not behave the way I expected. Some generations included extra text, repeated JSON objects, or added reasoning-style markers. The model was not necessarily the only problem; the prompt format and runtime mode mattered a lot.

The more reliable setup used a ChatML-style prompt, completion mode instead of chat mode, and flags to disable reasoning behavior. After that, the GGUF model became much more usable for the Actor role, which matched the improvement shown in the eval table above.

This was a good reminder that fine-tuning is only one part of deployment. The same model can behave very differently depending on the inference runtime and prompt wrapper.

What I learned

The main lesson was that small-model fine-tuning works best when the job is narrow and measurable. “Be a good puppet Actor” is vague, but “return one valid Actor JSON object for the next beat” is something I could train and evaluate.

I also learned that the boring parts are what make the model useful inside an app: dataset format, schema design, eval scripts, sanitizers, repair prompts, fallback behavior, and inference flags. The LoRA adapter was important, but it only worked well because the runtime around it expected failures and knew how to recover.

The next step would be to improve the dataset with more carefully designed examples instead of just making it stricter. I would also like to compare local LoRA, GGUF, and hosted inference more systematically inside the app.

Links and credits

Related posts and app:

Model and dataset artifacts:

Credits and thanks:

  • The Hugging Face and Gradio teams for organizing the Build Small Hackathon and providing the platform, credits, and motivation to build something small and complete.
  • OpenBMB for MiniCPM5-1B, which became the base model for the Actor fine-tuning experiment.
  • Modal for making it practical to run the LoRA/QLoRA training workflow on a CUDA GPU without setting up a separate training machine.
  • llama.cpp for the local GGUF inference path, which made it possible to test the merged Actor model in a more off-grid setup.

Community

Sign up or log in to comment