How We Fought Gradio and Won. PaperProf Field Notes

It's 11 PM, the exam is tomorrow, and you're re-reading the same lecture PDF for the fourth time, feeling productive while learning nothing. Passive re-reading is one of the worst study techniques on record. Active recall, forcing yourself to answer questions, is one of the best.

So we built PaperProf: drop in any course PDF and it becomes your personal professor. It reads the material, generates exam-style questions from it, grades your answers like a patient tutor, and paints you a parting image when you finish. Everything runs on free infrastructure with zero external API calls. No OpenAI key, no rate limits, no data leaving the machine.

Chapter 01

The team

PaperProf was built by two EPITA students who spent ten days arguing with Gradio so you don't have to.

Co-creator

Ryad Gazenay

GitHub LinkedIn

Co-creator

Mehdi Azouz

GitHub LinkedIn

Chapter 02

What it does

MODE A

Open questions

Write a free-form answer and get structured tutor feedback: a verdict, what you got right, what you missed, and a model answer.

MODE B

MCQ

Four plausible options, instant client-side grading, and a one-sentence explanation for every choice, not just the right one.

LIVE

Score ring

An animated SVG arc tracks your session in real time and shifts color with your accuracy.

REWARD

Session image

End the session and FLUX.2-klein generates a unique image from the topics you just studied.

The whole loop runs on MiniCPM4.1-8B, our QLoRA fine-tune of openbmb's latest 8B model, loaded once and shared between question generation and answer evaluation. PyMuPDF extracts the text, a chunker splits it into thematic sections, and the model picks up from there.

Chapter 03

What the git log actually says

A hackathon README tells you what was built. The git log tells you what happened. Ours has 101 commits and roughly two-thirds of them start with fix:. Here is the honest version.

paperprof — git log

Model choice is a compatibility problem, not a benchmark problem

We started with MiniCPM3-4B, upgraded to MiniCPM4-8B for better reasoning, and immediately hit the classic open-model trap: the model card says one thing, the transformers version on your machine says another.

fix: pin transformers==4.57.1 for MiniCPM4-8B compatibility

The follow-up lesson came from quantization. Bitsandbytes 4-bit is great on a 16 GB local GPU and completely unnecessary on ZeroGPU hardware, so we made it conditional:

# HF Spaces (ZeroGPU): skip quantization, use bfloat16 directly
if os.environ.get("SPACE_ID"):
    return None
# Locally: 4-bit when VRAM < 17 GB

Same code, two deployment targets, zero config files. Detect the environment, adapt.

The custom UI nearly broke us, and taught us the most

The hackathon has an Off-Brand badge: ship a UI that doesn't look like the framework you built it with. We wanted PaperProf to look like a real product. Glassmorphism, animated score ring, dark academia palette. Not a Gradio demo.

Attempt 01 / Failed

Restyle Gradio with CSS

Eleven consecutive commits of theme warfare. Gradio's theming always had one more !important than we did.

Attempt 02 / Failed

Nuke it from orbit: Docker + FastAPI

Raw HTML served by FastAPI, Gradio relegated to a backend. Worked locally, died on Spaces. ZeroGPU only flows through the Gradio SDK.

Attempt 03 / Shipped

The hidden-component bridge

Keep Gradio as an invisible backend inside the page. A fully custom HTML/CSS/JS interface in gr.HTML, every real Gradio component hidden off-screen, and a 300 ms polling loop ferrying data between the two worlds.

This pattern produced the three hardest-won discoveries of the hackathon.

display: none silently kills Gradio. Components hidden that way never get their Svelte event handlers attached. The fix is the oldest trick in CSS:

/* collapsed but NOT display:none, so Gradio attaches handlers */
#hidden-row-question { height: 0 !important; overflow: visible !important; }

You can't .click() a Gradio button from JS. Server-side rendering means the synthetic click goes nowhere. What does work: setting a hidden textbox's value through the native property descriptor, then dispatching events so Svelte notices:

function setGradioTA(sel, val) {
  const el = document.querySelector(sel);
  Object.getOwnPropertyDescriptor(HTMLTextAreaElement.prototype, 'value')
        .set.call(el, val);
  el.dispatchEvent(new Event('input',  {bubbles: true}));
  el.dispatchEvent(new Event('change', {bubbles: true}));
}

Every action in PaperProf, from generating a question to submitting an answer, is a timestamp written into a hidden textbox, picked up by a .change() listener on the Python side. Buttons that aren't buttons.

Sometimes the dumb solution is the senior solution.

MutationObserver loses to Svelte. Gradio's reactive DOM updates don't always fire observers the way you'd expect. We surrendered and switched to a humble setInterval polling loop. Less elegant, infinitely more reliable.

ZeroGPU makes you think in seconds

ZeroGPU gives you a serious GPU for free, but only in short decorated windows. That budget reshapes your architecture:

COLD START

60 to 90 seconds, be honest about it

Loading an 8B model takes a while the first time. The UI shows a live elapsed-time counter, escalating messages, and a 3-minute hard timeout that unlocks the UI instead of spinning forever.

PREFETCH

Never download inside the GPU window

FLUX.2-klein weighs about 16 GB. We prefetch it in a daemon thread at startup, so the @spaces.GPU window is spent generating, not downloading.

CLIENT-SIDE

Don't burn GPU on what JS can do

MCQ grading needs no model call. The LLM emits a structured format once, we parse it to JSON, and the browser grades clicks instantly. Zero latency, zero GPU seconds.

TRIM

Skip what you never read

The FLUX repo ships a 7.75 GB duplicate ComfyUI checkpoint that diffusers never touches. One ignore pattern saved half the download.

The bug that fired twice

Late in the hackathon, our session-summary modal showed every MCQ answer duplicated: answer one question, see it counted twice, score 0/2.

The cause was textbook event handling. MCQ buttons had btn.onclick = handler assigned in the display function and an addEventListener registered by the global wiring function. One click, two handlers, two score increments. Our first fix removed the wrong one and clicks then did nothing at all. The final fix kept the onclick, reassigned fresh with each question and inherently idempotent, plus a re-entrancy guard.

When two pieces of code both helpfully wire the same button, you don't have redundancy. You have a race.

Prompts are product decisions

Small prompt details made the difference between tech demo and usable study tool. Early questions were rambling multi-part monsters. The fix was brutal constraint: "ONE question only, on ONE concept. Maximum 25 words. No sub-questions." The evaluator follows a fixed 4-part structure so the frontend can parse and render it as styled sections. Prompt format is API contract.

And with French source PDFs, the model kept drifting into French. Polite instructions lost to the gravitational pull of the context. What finally worked: IMPORTANT: Always write in English, stated twice, top and bottom of the prompt. With 8B models, subtlety is wasted. Repetition is a feature.

Chapter 04

What we'd tell past us

Read the git log of your own project.Two-thirds fix: commits isn't failure. It's the actual texture of shipping, and each one was a lesson nobody had written down for us.
Frameworks fight back hardest at the edges.Using Gradio normally is easy. Using it as an invisible backend required understanding how it actually renders.
Free infrastructure imposes honest engineering.No API credits to hide behind means caring about cold starts, GPU seconds, and weight prefetching. Constraints made the architecture better.
Client-side everything you can.The MCQ mode is the snappiest feature in the app precisely because it never touches the server after generation.
Ship the small thing.PaperProf does one loop, read, ask, grade, encourage, and does it end-to-end. A project that completes one circle beats one that sketches five.

Chapter 05

The stack

Layer	Choice
Q&A + evaluation	MiniCPM4.1-8B · QLoRA fine-tune (build-small-hackathon/MiniCPM4.1-8B-PaperProf) · bfloat16 · transformers 4.57.1
Session images	FLUX.2-klein-4B (Black Forest Labs) · diffusers
PDF parsing	PyMuPDF
Backend / hosting	Gradio 6 on Hugging Face Spaces · ZeroGPU
Frontend	Hand-written HTML/CSS/JS over a hidden-Gradio bridge
External APIs	None. Fully off the grid.

Chapter 06

After the deadline: upgrading to MiniCPM4.1-8B

The hackathon ended. Then openbmb released MiniCPM4.1-8B — a new version with better reasoning and a built-in thinking mode. We upgraded.

Three things changed in the pipeline:

UPGRADE

New base model

Swapped openbmb/MiniCPM4-8B for openbmb/MiniCPM4.1-8B. The new model has a thinking mode — chain-of-thought reasoning tokens that bloat structured outputs. We disable it: enable_thinking=False.

RE-TUNE

New fine-tune on the same data

Same QLoRA recipe (r=16, all-linear, 1 epoch), same 3 500 training pairs from SQuAD and SciQ in PaperProf's exact prompt format. Published at build-small-hackathon/MiniCPM4.1-8B-PaperProf.

GGUF

New quantized runtime

The merged bf16 model is converted to Q4_K_M GGUF via llama.cpp and published at build-small-hackathon/MiniCPM4.1-8B-PaperProf-GGUF for the llama.cpp CPU runtime.

TRACE

Agent trace on the Hub

12 live LLM calls across 3 sessions (OS, ML, Networking) — exact prompts, raw outputs, timings — published as a dataset at build-small-hackathon/PaperProf-traces for the community to learn from.

The upgrade took less than an hour of code changes. The fine-tune ran in ~20 minutes on a Modal A100-80GB. The lesson from the hackathon held: constraints make the architecture honest, and a well-structured pipeline makes iteration cheap.

How we fought Gradio, won, and shipped in 10 days.