Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.19.0
title: Recall — AI Study Partner
emoji: 📚
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 6.10.0
app_file: server.py
pinned: false
license: mit
tags:
- track:backyard
- sponsor:openbmb
- achievement:offgrid
- achievement:offbrand
📚 Recall — an AI study partner that gets smarter about what you get wrong
Upload your study material — typed notes, a PDF, even a photo or scan of a page → Recall generates a quiz deck → you answer → a small model grades and explains each answer → it generates new questions targeting exactly what you missed → end-of-session recap. Built for the Build Small Hackathon (Backyard AI track).
- Model: openbmb/MiniCPM-V-4.6 — multimodal (grades text and reads images/scans). Text-only fallbacks: MiniCPM4.1-8B, MiniCPM5-1B, MiniCPM3-4B.
- Platform: Gradio app, hosted as a Hugging Face Space
- Demo video: YouTube
- Social post: LinkedIn
Team
| Member | Hugging Face |
|---|---|
| Nikolai | @nz-nz |
| Frank | @francisco-magana |
| Arturo | @arturogp3 |
Run it (stub mode — no GPU, no model download)
pip install -r requirements.txt
python server.py # http://127.0.0.1:7860 ← polished custom frontend
Everything works end-to-end on canned data, so anyone can clone and click through the full loop in minute one.
server.py serves the Recall design (frontend/index.html) and a thin JSON
API over the existing backend — the learning/content logic and the schema.py
data contract are treated as an API and are never modified. It's built on
gradio.Server (a FastAPI subclass), so the same gradio-SDK Space that installs
gradio also runs the custom frontend; app.launch(prevent_thread_lock=True) binds
port 7860 directly while the main thread is held open. The original Gradio form is
still available standalone via python app.py.
Run with the real model
The heavy model deps (torch/transformers/…) are kept out of requirements.txt so
the Space build stays fast in stub mode. Install them with the model requirements:
pip install -r requirements-model.txt
RECALL_STUB=0 python server.py
Dependency pins (why gradio is 6.10.0). The binding constraint is the custom-frontend server: it uses
gradio.Server, and on gradio 6.17.x a customServerbreaks under a Space's runtime (app starts, process exits →RUNTIME_ERROR). gradio 6.10.0 is the version gradio's own ZeroGPUServerreference example ships and runs cleanly. It also resolves with the real model: MiniCPM-V 4.6 runs on transformers 5.x, which wants huggingface-hub 1.x, and 6.10.0 allowshuggingface-hub <2.0,>=0.33.5(i.e. hub 1.x). A gradio-SDK Space force-installs one gradio for the whole Space, so stub and real-model share it without a Docker Space — keeprequirements.txt,requirements-model.txtand the Spacesdk_versionin lockstep. The smaller text fallbacks add no extra constraint.
On Apple Silicon (M1/M2/…), the default bf16 + MPS combo produces garbage output (a known MPS bf16 instability — not present on the Space's CUDA GPU). For a clean local real-model smoke test, force CPU/float32:
RECALL_STUB=0 RECALL_MODEL=1b RECALL_DTYPE=float32 RECALL_DEVICE=cpu python server.py
The model
Recall runs on openbmb/MiniCPM-V-4.6, an open multimodal model from OpenBMB chosen for the Backyard AI track: small enough to serve on a single Hugging Face ZeroGPU Space, capable enough to grade free-text answers, write grounded follow-up questions, and read scanned or photographed material directly. One model does both the text and the vision work.
Where the model is load-bearing. Three user-visible features are pure model work, not templated strings:
- Grading — it compares your free-text answer to the reference answer and returns a 0–5 score, a plain-language explanation, and the specific concept you missed.
- Adaptive follow-ups — from that missed concept it writes brand-new questions that drill exactly what you got wrong.
- Vision / OCR — image-only or scanned PDFs that have no selectable text are rendered to images and read by the model directly to build the deck (
content_pipeline.py), so slide photos and scans work, not just digital text.
How inference is served. Everything model-related goes through a single chat(messages, max_tokens) wrapper in llm.py; no other module imports transformers directly. The model is loaded once, lazily, on the Space's ZeroGPU — the multimodal default via MiniCPMV4_6ForConditionalGeneration + an AutoProcessor, the text-only fallbacks via AutoModelForCausalLM + AutoTokenizer — in bf16 with device_map="auto", and the GPU entrypoint is wrapped in @spaces.GPU. max_tokens is kept tight (256–512) because latency is the demo-killer. Model output is never trusted: replies expected to be JSON are parsed defensively, with one repair retry and a safe fallback so a malformed generation can never crash the study loop.
Stub mode. With RECALL_STUB=1 (the default) chat() returns canned replies, so the whole app runs and demos end-to-end with no GPU and no model download. Flip RECALL_STUB=0 to use the real model.
Fallback (config flip, no code change). If the Space is too slow or runs out of memory, swap to a smaller model by setting RECALL_MODEL — the rest of the pipeline is unchanged (the text-only fallbacks drop the image/OCR path):
# text fallback (8B)
RECALL_MODEL=8b RECALL_STUB=0 python server.py # MiniCPM4.1-8B
# fast fallback
RECALL_MODEL=1b RECALL_STUB=0 python server.py # MiniCPM5-1B
# mid fallback — ≤4B, so it qualifies for the Tiny Titan prize
RECALL_MODEL=4b RECALL_STUB=0 python server.py # MiniCPM3-4B
Project layout
| File | Owner | What it is |
|---|---|---|
schema.py |
shared | The data contract (Card, CardState, GradeResult, Session). Don't change without a sync. |
llm.py |
Nikolai | Shared MiniCPM inference wrapper + defensive JSON parsing. |
learning_engine.py |
Nikolai | Scheduling (SM-2-lite), grading, adaptation, follow-ups, recap. |
content_pipeline.py |
Frank | Text & image PDFs → chunks (scans render to page images for the vision model) → question cards. |
app.py |
Arturo | Gradio UI (Upload / Study / Recap) over gr.State — standalone fallback (python app.py). |
server.py |
— | FastAPI server: serves the custom frontend + JSON API over the backend. |
frontend/index.html |
— | The polished Recall design (Upload / Study / Recap), vanilla HTML/CSS/JS. |
How to work in parallel
- At kickoff, lock
schema.pytogether. - Each module already ships working stubs — build your real logic behind the
same function signatures, flip
RECALL_STUB=0to test for real. - Don't change public function signatures without telling the team.
The judging hook
The small model is load-bearing in three visible places: grading free-text answers with explanations, generating follow-up questions that drill the exact concept you missed, and reading scanned/photographed material to build the deck. Make sure the demo shows them.