Field Notes: Perfector(Post Audit) — a brief-aware social-post auditor in 4.5B params

Community Article
Published June 14, 2026

Summary

Post Audit checks a social-media draft against the goal and audience you wrote it for — before you publish. You paste the platform, your goal ("register attendees for Thursday's 7pm webinar"), your audience ("product managers who own metrics"), and the draft. You get back a structured readiness report: a brief check, five scored dimensions, a catalogue of warnings, and concrete rewrite hints in the post's own language.

Under the hood it's a deliberately hybrid pipeline. Deterministic rule linters run on the Gradio Space and catch the mechanical stuff (hashtag stuffing, chat-dump formatting, bare links, missing deadlines). A small LLM — Gemma 4 E4B (4.5B effective parameters, well under the 32B hackathon limit) — handles the judgment that rules can't: does the post actually serve its stated goal, is the tone right for the audience, is the call-to-action clear. The host then recomputes the overall score and critical caps itself, so the final number never depends on the model doing arithmetic correctly.

The model is served as a quantized GGUF via Ollama / llama.cpp — the same model and runner locally and in production (on a Modal L4 GPU). Local dev and prod behave identically.

How this maps to the hackathon

Post Audit is built for the Build Small Hackathon and submitted to the Backyard AI track — solving a real, recurring problem for people we actually know (see Who it's for below). Beyond the track, it lines up with several bonus paths:

  • Backyard AI (track) — a pre-publish QA tool for real people growing real audiences, not a demo for nobody.
  • Llama Champion — the model runs on the llama.cpp runtime (served as a quantized GGUF through Ollama), the same stack in dev and prod.
  • Modal Awards — production inference runs on a Modal L4 GPU endpoint, with the model cached in a Modal Volume to keep cold starts manageable.
  • Field Notes — this write-up.
  • Off-Brand — the UI is not stock Gradio: the entire report is bespoke HTML rendered by the app (render.py) over a custom theme and design system (Space Grotesk / IBM Plex), built around a "readiness-panel" identity rather than default components. (Note: the app is a heavily themed gr.Blocks app with a custom-rendered report, not a separate gr.Server frontend — we're claiming the custom-UI spirit of the badge, not a standalone server.)

Links

The problem

Community managers, team leads, and anyone building a personal brand write posts with a goal in mind — and then publish drafts that quietly miss it. The deadline is buried three lines down. The lede is an apology. The post explains a thing beautifully but never asks the reader to do the thing. Manual review catches some of this, but inconsistently and slowly, and you're the worst reviewer of your own draft.

Post Audit narrows the question. It doesn't try to rewrite your post for you or generate content from scratch. It asks one thing: does this draft serve the goal and audience you stated? — and shows its reasoning as discrete, explainable flags you can act on.

Who it's for (real backyard users)

This is the Backyard AI part — concrete people, not personas:

  • A cardiologist building a personal medical brand on Telegram and Instagram. Patient-education posts have to land a clear message and a clear next step for a lay audience — easy to get wrong when you're an expert writing from inside the jargon. Post Audit flags audience-fit and CTA problems before a post goes out to patients.
  • Pavel Trubin (co-author), running an emerging blog. Every early post is a chance to set the tone and reader expectation. He audits each draft against its goal and target reader before publishing — the difference between a blog that compounds and one that drifts.

The decomposed, transparent architecture

Gradio Space (HF)          Modal (L4 GPU)
     │                           │
     ├─ rules.py (sync)          └─ Ollama · gemma4:e4b (GGUF)
     ├─ merge.py (scores)
     └─ render.py (report UI)

Three layers, each doing what it's good at:

  1. Rule linters (rules.py, on the Space). Fast, deterministic, model-free. Hashtag density, chat-dump structure, bare links, missing deadlines, and friends. These never hallucinate and never cost a GPU second.
  2. LLM judgment (Gemma 4 E4B, on Modal). Goal alignment, hook strength, audience fit, CTA clarity, plus rewrite hints — the genuinely subjective calls, returned as constrained JSON.
  3. Host-side merge and recompute (merge.py). The five dimensions — hook, clarity, audienceFit, goalService, cta — are averaged and normalized to a 0–100 overall by the host. If a critical warning fires, the host caps overall itself. The model's arithmetic is never trusted — it scores dimensions and raises warnings; the host owns the math.

The payoff is trust: every number on screen traces back to either a rule or a model judgment, and the final verdict is computed by code you can read.

Small model, same stack local and prod

A constrained, well-scoped judgment task doesn't need a frontier model. Gemma 4 E4B (4.5B effective) is enough to assess whether a draft serves its goal — and it's small enough to run on a single L4. Because we serve the same quantized GGUF through Ollama/llama.cpp in development and on Modal, there's no "works on my machine" gap: you can iterate against the real production model locally, with no cloud credits, then deploy the identical stack.

What I learned

  • Decompose for trust. Splitting deterministic rules from LLM judgment, and keeping score arithmetic on the host, made the output explainable — and made the LLM's job small enough for a tiny model.
  • Never trust the model's math. Letting the LLM score dimensions but recomputing the overall and the caps in code removed a whole class of inconsistent results.
  • Tiny models are plenty for narrow tasks. A 4.5B model, given a tight prompt and a clear schema, does goal/audience judgment well. Match the model to the task, not to the leaderboard.

Try it

Open the Space, load one of the built-in examples (a weak webinar CTA, or a chat-dump handoff), and run an audit. First run after idle can take ~2 minutes while the GPU loads the model; later runs are quick.

Post Audit is a drafting aid, not an oracle — it tells you where a draft drifts from its stated goal so you can decide what to do about it.

Authors

Community

Sign up or log in to comment