
It's 11 PM, the exam is tomorrow, and you're re-reading the same lecture PDF for the fourth time, feeling productive while learning nothing. Passive re-reading is one of the worst study techniques on record. Active recall, forcing yourself to answer questions, is one of the best.
So we built PaperProf: drop in any course PDF and it becomes your personal professor. It reads the material, generates exam-style questions from it, grades your answers like a patient tutor, and paints you a parting image when you finish. Everything runs on free infrastructure with zero external API calls. No OpenAI key, no rate limits, no data leaving the machine.
The team
PaperProf was built by two EPITA students who spent ten days arguing with Gradio so you don't have to.
What it does
Open questions
Write a free-form answer and get structured tutor feedback: a verdict, what you got right, what you missed, and a model answer.
MCQ
Four plausible options, instant client-side grading, and a one-sentence explanation for every choice, not just the right one.
Score ring
An animated SVG arc tracks your session in real time and shifts color with your accuracy.
Session image
End the session and FLUX.2-klein generates a unique image from the topics you just studied.
The whole loop runs on MiniCPM4.1-8B, our QLoRA fine-tune of openbmb's latest 8B model, loaded once and shared between question generation and answer evaluation. PyMuPDF extracts the text, a chunker splits it into thematic sections, and the model picks up from there.
What the git log actually says
A hackathon README tells you what was built. The git log tells you what happened. Ours has 101 commits and roughly two-thirds of them start with fix:. Here is the honest version.
Model choice is a compatibility problem, not a benchmark problem
We started with MiniCPM3-4B, upgraded to MiniCPM4-8B for better reasoning, and immediately hit the classic open-model trap: the model card says one thing, the transformers version on your machine says another.
The follow-up lesson came from quantization. Bitsandbytes 4-bit is great on a 16 GB local GPU and completely unnecessary on ZeroGPU hardware, so we made it conditional:
# HF Spaces (ZeroGPU): skip quantization, use bfloat16 directly
if os.environ.get("SPACE_ID"):
return None
# Locally: 4-bit when VRAM < 17 GB
Same code, two deployment targets, zero config files. Detect the environment, adapt.
The custom UI nearly broke us, and taught us the most
The hackathon has an Off-Brand badge: ship a UI that doesn't look like the framework you built it with. We wanted PaperProf to look like a real product. Glassmorphism, animated score ring, dark academia palette. Not a Gradio demo.
Restyle Gradio with CSS
Eleven consecutive commits of theme warfare. Gradio's theming always had one more !important than we did.
Nuke it from orbit: Docker + FastAPI
Raw HTML served by FastAPI, Gradio relegated to a backend. Worked locally, died on Spaces. ZeroGPU only flows through the Gradio SDK.
The hidden-component bridge
Keep Gradio as an invisible backend inside the page. A fully custom HTML/CSS/JS interface in gr.HTML, every real Gradio component hidden off-screen, and a 300 ms polling loop ferrying data between the two worlds.
This pattern produced the three hardest-won discoveries of the hackathon.
display: none silently kills Gradio. Components hidden that way never get their Svelte event handlers attached. The fix is the oldest trick in CSS:
/* collapsed but NOT display:none, so Gradio attaches handlers */
#hidden-row-question { height: 0 !important; overflow: visible !important; }
You can't .click() a Gradio button from JS. Server-side rendering means the synthetic click goes nowhere. What does work: setting a hidden textbox's value through the native property descriptor, then dispatching events so Svelte notices:
function setGradioTA(sel, val) {
const el = document.querySelector(sel);
Object.getOwnPropertyDescriptor(HTMLTextAreaElement.prototype, 'value')
.set.call(el, val);
el.dispatchEvent(new Event('input', {bubbles: true}));
el.dispatchEvent(new Event('change', {bubbles: true}));
}
Every action in PaperProf, from generating a question to submitting an answer, is a timestamp written into a hidden textbox, picked up by a .change() listener on the Python side. Buttons that aren't buttons.
Sometimes the dumb solution is the senior solution.
MutationObserver loses to Svelte. Gradio's reactive DOM updates don't always fire observers the way you'd expect. We surrendered and switched to a humble setInterval polling loop. Less elegant, infinitely more reliable.
ZeroGPU makes you think in seconds
ZeroGPU gives you a serious GPU for free, but only in short decorated windows. That budget reshapes your architecture:
60 to 90 seconds, be honest about it
Loading an 8B model takes a while the first time. The UI shows a live elapsed-time counter, escalating messages, and a 3-minute hard timeout that unlocks the UI instead of spinning forever.
Never download inside the GPU window
FLUX.2-klein weighs about 16 GB. We prefetch it in a daemon thread at startup, so the @spaces.GPU window is spent generating, not downloading.
Don't burn GPU on what JS can do
MCQ grading needs no model call. The LLM emits a structured format once, we parse it to JSON, and the browser grades clicks instantly. Zero latency, zero GPU seconds.
Skip what you never read
The FLUX repo ships a 7.75 GB duplicate ComfyUI checkpoint that diffusers never touches. One ignore pattern saved half the download.
The bug that fired twice
Late in the hackathon, our session-summary modal showed every MCQ answer duplicated: answer one question, see it counted twice, score 0/2.
The cause was textbook event handling. MCQ buttons had btn.onclick = handler assigned in the display function and an addEventListener registered by the global wiring function. One click, two handlers, two score increments. Our first fix removed the wrong one and clicks then did nothing at all. The final fix kept the onclick, reassigned fresh with each question and inherently idempotent, plus a re-entrancy guard.
When two pieces of code both helpfully wire the same button, you don't have redundancy. You have a race.
Prompts are product decisions
Small prompt details made the difference between tech demo and usable study tool. Early questions were rambling multi-part monsters. The fix was brutal constraint: "ONE question only, on ONE concept. Maximum 25 words. No sub-questions." The evaluator follows a fixed 4-part structure so the frontend can parse and render it as styled sections. Prompt format is API contract.
And with French source PDFs, the model kept drifting into French. Polite instructions lost to the gravitational pull of the context. What finally worked: IMPORTANT: Always write in English, stated twice, top and bottom of the prompt. With 8B models, subtlety is wasted. Repetition is a feature.
What we'd tell past us
- Read the git log of your own project.Two-thirds
fix:commits isn't failure. It's the actual texture of shipping, and each one was a lesson nobody had written down for us. - Frameworks fight back hardest at the edges.Using Gradio normally is easy. Using it as an invisible backend required understanding how it actually renders.
- Free infrastructure imposes honest engineering.No API credits to hide behind means caring about cold starts, GPU seconds, and weight prefetching. Constraints made the architecture better.
- Client-side everything you can.The MCQ mode is the snappiest feature in the app precisely because it never touches the server after generation.
- Ship the small thing.PaperProf does one loop, read, ask, grade, encourage, and does it end-to-end. A project that completes one circle beats one that sketches five.
The stack
| Layer | Choice |
|---|---|
| Q&A + evaluation | MiniCPM4.1-8B · QLoRA fine-tune (build-small-hackathon/MiniCPM4.1-8B-PaperProf) · bfloat16 · transformers 4.57.1 |
| Session images | FLUX.2-klein-4B (Black Forest Labs) · diffusers |
| PDF parsing | PyMuPDF |
| Backend / hosting | Gradio 6 on Hugging Face Spaces · ZeroGPU |
| Frontend | Hand-written HTML/CSS/JS over a hidden-Gradio bridge |
| External APIs | None. Fully off the grid. |
After the deadline: upgrading to MiniCPM4.1-8B
The hackathon ended. Then openbmb released MiniCPM4.1-8B — a new version with better reasoning and a built-in thinking mode. We upgraded.
Three things changed in the pipeline:
New base model
Swapped openbmb/MiniCPM4-8B for openbmb/MiniCPM4.1-8B. The new model has a thinking mode — chain-of-thought reasoning tokens that bloat structured outputs. We disable it: enable_thinking=False.
New fine-tune on the same data
Same QLoRA recipe (r=16, all-linear, 1 epoch), same 3 500 training pairs from SQuAD and SciQ in PaperProf's exact prompt format. Published at build-small-hackathon/MiniCPM4.1-8B-PaperProf.
New quantized runtime
The merged bf16 model is converted to Q4_K_M GGUF via llama.cpp and published at build-small-hackathon/MiniCPM4.1-8B-PaperProf-GGUF for the llama.cpp CPU runtime.
Agent trace on the Hub
12 live LLM calls across 3 sessions (OS, ML, Networking) — exact prompts, raw outputs, timings — published as a dataset at build-small-hackathon/PaperProf-traces for the community to learn from.
The upgrade took less than an hour of code changes. The fine-tune ran in ~20 minutes on a Modal A100-80GB. The lesson from the hackathon held: constraints make the architecture honest, and a well-structured pipeline makes iteration cheap.
