Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.18.0
Implementation Plan β "Cook With Me"
Step-by-step implementation guide for developers building the multimodal cooking sous-chef Gradio app for Hugging Face Spaces.
Hackathon: Small models / Big adventures β June 2026 Read first:
plan.md(the what and why) andestrategia.md(the how at a strategic level). This document is the how at a tactical level β turn this into code.
0. Locked decisions (do not re-discuss)
| Decision | Value | Reason |
|---|---|---|
| UI framework | Gradio | Hackathon requirement |
| Hosting | Hugging Face Space | Hackathon requirement |
| Inference runtime (text + vision) | llama.cpp via llama-cpp-python |
Runs inside the Space CPU, no external APIs needed for now. Future: migrate to Modal |
| Image generation | FLUX.2 Klein 9B (black-forest-labs/FLUX.2-klein-9B) |
Sponsor model; runs in the Space if a GPU Space is rented (or via enable_model_cpu_offload() as fallback). Plan to migrate this specific component to Modal post-hackathon |
| Recipe planner / reasoning | openbmb/MiniCPM-V-4 (GGUF) |
Provided requirement |
| Vision (ingredient ID + progress validator) | openbmb/MiniCPM-V-4.6 (GGUF) |
Provided requirement |
| Text-to-speech | OpenBMB VoxCPM2 | Provided requirement |
| Recipe dataset | thedevastator/better-recipes-for-a-better-life (Kaggle) β international cuisine |
Provided requirement; not limited to Mexican food |
| App language | English only | Provided requirement |
| Final output | Recipe + step images + voice + nutritional values | Provided requirement |
| External API calls at runtime | None | "llama.cpp inside the Space" mandate |
1. Architecture (final, English-only, llama.cpp-first)
ββββββββββββββββββββββββββββββββββββββββ
β Hugging Face Space (Gradio) β
β (CPU + optional GPU upgrade) β
ββββββββββββββββββββββββββββββββββββββββ€
πΈ Fridge photo ββββββΆβ [Vision Agent] β
β MiniCPM-V-4.6 GGUF (llama.cpp) β
β β list[ingredient] β
β β β
β βΌ β
π₯ User picks dish ββββΆβ [Recipe Planner] β
β MiniCPM-V-4 GGUF (llama.cpp) β
β + retrieval over Kaggle dataset β
β β Recipe JSON (steps, nutrition) β
β β β
β βΌ β
β [Step Illustrator] β
β FLUX.2 Klein 9B (diffusers) β
β β PNG per step + final dish β
β β β
β βΌ β
β [Narrator] β
β VoxCPM2 β MP3 per step β
β β β
β βΌ β
πΈ Progress photo βββββΆβ [Progress Validator] β
β MiniCPM-V-4.6 (vision compare) β
β β "go / wait / fix" + tip β
ββββββββββββββββββββββββββββββββββββββββ
Total parameter count (β€ 32B requirement):
- MiniCPM-V-4 (reasoning) β 4B
- MiniCPM-V-4.6 (vision) β 4.6B
- FLUX.2 Klein β 9B
- VoxCPM2 β 1B (estimate)
- Total β 18.6B β
2. Repository layout
cook-with-me/
βββ app.py # Gradio entrypoint (Space looks for this)
βββ requirements.txt
βββ packages.txt # apt packages (ffmpeg, libsndfile1)
βββ README.md # Space card (HF requires YAML frontmatter)
βββ .gitignore
βββ src/
β βββ __init__.py
β βββ config.py # paths, model IDs, constants
β βββ models/
β β βββ __init__.py
β β βββ vision.py # MiniCPM-V-4.6 wrapper (llama-cpp)
β β βββ planner.py # MiniCPM-V-4 wrapper (llama-cpp)
β β βββ illustrator.py # FLUX.2 Klein wrapper (diffusers)
β β βββ narrator.py # VoxCPM2 wrapper
β β βββ loader.py # lazy singletons + GGUF download
β βββ agents/
β β βββ mise_en_place.py # ingredient identification
β β βββ recipe_planner.py # builds Recipe object
β β βββ step_illustrator.py # per-step image gen
β β βββ narrator.py # per-step TTS
β β βββ progress_validator.py
β βββ data/
β β βββ recipe_index.py # loads Kaggle dataset, builds retrieval
β β βββ nutrition.py # USDA-style nutrition computation
β βββ pipeline.py # Recipe state machine, orchestration
β βββ prompts/
β β βββ vision_prompt.txt
β β βββ planner_system.txt
β β βββ validator_prompt.txt
β βββ ui/
β βββ theme.py # custom CSS (Off-Brand badge)
β βββ components.py # reusable Gradio Blocks pieces
βββ scripts/
β βββ download_models.py # pre-warms GGUF + Flux weights at build time
β βββ build_recipe_index.py # caches Kaggle dataset locally
β βββ smoke_test.py # end-to-end validation before push
βββ assets/
βββ sample_fridge_1.jpg
βββ sample_progress_1.jpg
3. Phase-by-phase plan (10 days)
Each phase has: goal, tasks, deliverable, verification check. Do not move to the next phase if verification fails.
Phase 0 β Day 0 (Β½ day): Account + tooling setup
Goal: every credential and CLI is ready before writing code.
Tasks
- Create or confirm Hugging Face account; generate a write token (Settings β Access Tokens). Store as
HF_TOKENenv var locally. - Install Hugging Face CLI:
pip install -U huggingface_hubthenhuggingface-cli login. - Install Kaggle CLI:
pip install kaggle. Placekaggle.json(Account β API β Create New Token) in~/.kaggle/kaggle.jsonwithchmod 600. - Install OpenAI Codex CLI (pair-programmer) and verify your $100 credit is active.
- Install local Python 3.11 venv:
python -m venv .venv && source .venv/bin/activate. - Create the repo locally:
git init cook-with-me && cd cook-with-me. - Create an empty Hugging Face Space: huggingface.co β New Space β SDK = Gradio, Hardware = CPU basic (upgrade later if you need GPU for FLUX). Clone it and copy your repo skeleton into it.
- Verify model availability: open in a browser and confirm pages exist:
huggingface.co/openbmb/MiniCPM-V-4huggingface.co/openbmb/MiniCPM-V-4-6huggingface.co/openbmb/VoxCPM2(or whatever the exact repo name is β search "VoxCPM" on HF)huggingface.co/black-forest-labs/FLUX.2-klein-9B
Deliverable: empty Space deployed showing "Hello World" Gradio.
Verify: https://huggingface.co/spaces/<you>/cook-with-me loads.
Phase 1 β Day 1: Project skeleton + recipe dataset ingestion
Goal: the Kaggle dataset is downloaded, parsed, and cached as a local artifact ready for retrieval.
Tasks
- Write
requirements.txt(initial version β packages will be added as phases progress):gradio>=4.44 huggingface_hub>=0.24 llama-cpp-python>=0.3.2 numpy pandas Pillow pydantic>=2 sentence-transformers - Write
packages.txt:ffmpeg libsndfile1 - Write
scripts/build_recipe_index.py:- Use
kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, "thedevastator/better-recipes-for-a-better-life", file_path)β discoverfile_pathby listing the dataset files first viakagglehub.dataset_download. - Normalize columns:
name,ingredients(list[str]),instructions(list[str]),cuisine(str if present, else "international"),prep_time,servings. - Drop rows missing critical fields. Lowercase + strip ingredient strings.
- Save to
data/recipes.parquet(~5β50MB depending on dataset size). - Build sentence embeddings of the recipe name + first 3 ingredients using
sentence-transformers/all-MiniLM-L6-v2and save todata/recipes_emb.npy. - This script runs once locally; commit the parquet + npy files to the repo (or to a private HF Dataset, then download in
app.py). If files exceed 100MB, push to a HF Dataset repo:<you>/cook-with-me-recipes.
- Use
- Write
src/data/recipe_index.py:class RecipeIndexwith.search(ingredients: list[str], top_k=5) -> list[RecipeRow].- Build a query string from ingredients, embed it, cosine-similarity against the cached embeddings, return top-k.
Deliverable: python -c "from src.data.recipe_index import RecipeIndex; r=RecipeIndex(); print(r.search(['chicken','onion','tomato']))" prints 5 sensible recipes.
Verify: at least 3 of the top-5 results contain β₯2 of the input ingredients.
Phase 2 β Day 2: Vision agent (Mise en Place) β MiniCPM-V-4.6 via llama.cpp
Goal: given a fridge photo, return a clean list of English ingredient names.
Background: llama.cpp supports multimodal models through a vision projector (mmproj-*.gguf) plus the language model GGUF. MiniCPM-V family ships both files on the Hub.
Tasks
- Find the GGUF release of MiniCPM-V-4.6. Search HF for
MiniCPM-V-4_6-gguforopenbmb/MiniCPM-V-4_6-gguf. You need two files:Model-Q4_K_M.gguf(or similar quant)mmproj-model-f16.gguf(the vision projector)
- Write
src/models/loader.py:from huggingface_hub import hf_hub_download from llama_cpp import Llama from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler # or matching handler _vision = None def get_vision_model(): global _vision if _vision is None: model_path = hf_hub_download( repo_id="openbmb/MiniCPM-V-4_6-gguf", # confirm exact repo filename="Model-Q4_K_M.gguf", ) mmproj_path = hf_hub_download( repo_id="openbmb/MiniCPM-V-4_6-gguf", filename="mmproj-model-f16.gguf", ) handler = MiniCPMv26ChatHandler(clip_model_path=mmproj_path) _vision = Llama( model_path=model_path, chat_handler=handler, n_ctx=4096, n_threads=4, verbose=False, ) return _vision - Write
src/agents/mise_en_place.py:import base64, io, json from PIL import Image from src.models.loader import get_vision_model PROMPT = ( "You are an ingredient detector. Look at the fridge/pantry photo and " "list every edible ingredient you can identify. Return strict JSON: " '{"ingredients": ["chicken", "onion", "tomato", ...]} ' "Lowercase, English, no brand names, no containers." ) def _img_to_data_url(img: Image.Image) -> str: buf = io.BytesIO(); img.save(buf, "JPEG", quality=85) b64 = base64.b64encode(buf.getvalue()).decode() return f"data:image/jpeg;base64,{b64}" def identify_ingredients(image: Image.Image) -> list[str]: llm = get_vision_model() out = llm.create_chat_completion(messages=[ {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": _img_to_data_url(image)}}, {"type": "text", "text": PROMPT}, ]} ], temperature=0.2, response_format={"type": "json_object"}) data = json.loads(out["choices"][0]["message"]["content"]) return [s.lower().strip() for s in data["ingredients"]] - Test locally with 5 sample fridge photos.
Deliverable: the function returns a non-empty English list with β₯80% precision on a clean fridge photo.
Verify: stash these 5 results in tests/vision_smoke.json for regression checks.
Phase 3 β Day 3: Recipe Planner β MiniCPM-V-4 via llama.cpp + retrieval
Goal: given a list of ingredients (and optionally a chosen dish), return a fully structured Recipe JSON including steps, durations, visual descriptions, and nutritional values.
Tasks
- Find or convert MiniCPM-V-4 to GGUF. Likely repo:
openbmb/MiniCPM-V-4-ggufor community quants. PickQ4_K_M. - Add to
src/models/loader.pyaget_planner_model()(same pattern as vision but withoutchat_handler). - Write
src/agents/recipe_planner.py:- Step A β propose: call planner with
Tengo: [ingredients]. Propose 3 dish options that fit. Reply JSON. - Step B β retrieve: for the chosen dish name, call
RecipeIndex.search(...)and pick the closest match. Use it as a grounded reference. - Step C β restructure: prompt the planner with both the user's available ingredients and the retrieved reference recipe, asking it to output the canonical
RecipeJSON schema below. The retrieval grounds the model and prevents hallucinated steps. - Step D β nutrition: from the recipe ingredients, compute approximate nutritional values per serving. See Phase 3.5.
- Step A β propose: call planner with
- Define the canonical schema in
src/pipeline.pyusing Pydantic:from pydantic import BaseModel from typing import Optional class Step(BaseModel): n: int instruction: str # English, imperative duration: str # "4 minutes" visual: str # English visual description for FLUX prompt tip: Optional[str] = None class Nutrition(BaseModel): calories: int # per serving protein_g: float carbs_g: float fat_g: float fiber_g: float class Recipe(BaseModel): name: str cuisine: str servings: int total_time_minutes: int options: list[dict] # only populated on "propose" call ingredients_have: list[str] ingredients_missing: list[str] substitutes: dict[str, list[str]] steps: list[Step] final_dish_visual: str nutrition_per_serving: Nutrition - Write the system prompt (
src/prompts/planner_system.txt):- Persona: international chef
- Hard rule: output JSON only, matching schema
- Hard rule: prefer dishes feasible with available ingredients
- Hard rule: 5β7 steps, each β€ 25 words, each with a concrete
visualfield for image generation - Hard rule: include
nutrition_per_serving(model is allowed to estimate; you'll override withdata/nutrition.pyfor accuracy)
- Use
response_format={"type": "json_object"}in the chat completion call. Settemperature=0.7, top_p=0.95, enable_thinking=Truefor the propose step (creative);temperature=0.4for the structured-output step (deterministic).
Deliverable: for ["chicken","onion","tomato","tortilla","cheese"] and chosen dish "chicken tinga", the function returns a valid Recipe Pydantic object with 5β7 steps.
Verify: the JSON parses, each step has all required fields, and total inference time on Space CPU < 60 seconds.
Phase 3.5 β Day 3 (afternoon): Nutritional values
Goal: the recipe ends with reliable per-serving nutrition (not hallucinated by the LLM).
Approach: small, embedded reference table beats LLM math.
Tasks
- Bundle
data/nutrition_table.csvβ a 200-row CSV mapping common English ingredient names to per-100g macros (kcal, protein, carbs, fat, fiber). Source: USDA FoodData Central CSV download (free, public domain). Trim columns; commit to repo. - Write
src/data/nutrition.py:parse_quantity(line: str) -> (grams, ingredient_name)β handle "2 cups flour", "200 g chicken", "1 tbsp olive oil". Use a small regex + a unit-to-grams table (cup=240, tbsp=15, tsp=5, oz=28.35).compute_nutrition(ingredient_lines: list[str], servings: int) -> Nutritionβ sum per-100g values weighted by grams, divide by servings.- If a line cannot be parsed, skip it and log; don't crash.
- After the planner returns a recipe, overwrite
recipe.nutrition_per_servingwith the computed value. Keep the LLM's value only as a fallback when the parser yields zero.
Deliverable: for a known recipe (e.g., spaghetti with tomato sauce, 4 servings), computed calories per serving is within Β±25% of online references.
Phase 4 β Day 4: Step Illustrator β FLUX.2 Klein 9B
Goal: generate an appetizing image for the final dish + one image per step.
Constraint: FLUX.2 Klein on CPU is impractical; on a free Space CPU it would take ~10 minutes per image. Two paths:
- Path A (recommended for the hackathon): upgrade the Space to a GPU instance (T4 or A10G β paid, but $20 HF credits cover it for a week of development). Code stays unchanged.
- Path B (fallback): run FLUX in
enable_model_cpu_offload()mode withnum_inference_steps=4and accept ~3 min/image β only feasible for pre-rendered demo recipes, not live runs.
Tasks
- Add to
requirements.txt:diffusers>=0.31 transformers>=4.45 accelerate torch safetensors - Write
src/models/illustrator.py:import torch from diffusers import Flux2KleinPipeline _pipe = None def get_flux(): global _pipe if _pipe is None: dtype = torch.bfloat16 _pipe = Flux2KleinPipeline.from_pretrained( "black-forest-labs/FLUX.2-klein-9B", torch_dtype=dtype, ) _pipe.enable_model_cpu_offload() return _pipe def render(prompt: str, seed: int = 0) -> "PIL.Image.Image": pipe = get_flux() device = "cuda" if torch.cuda.is_available() else "cpu" img = pipe( prompt=prompt, height=1024, width=1024, guidance_scale=1.0, num_inference_steps=4, generator=torch.Generator(device=device).manual_seed(seed), ).images[0] return img - Write
src/agents/step_illustrator.py:- For each
Step.visual, build a prompt like:f"Top-down photo of a kitchen pan or plate showing {visual}. {cuisine} home cooking, warm natural lighting, recipe magazine style, photorealistic, appetizing." - Generate the final dish image first, then the per-step images, all in one Python loop (no parallelism β FLUX holds the GPU).
- Cache results on disk keyed by
hash(prompt)to avoid re-renders on re-runs. - Emit Gradio progress updates so the UI doesn't appear frozen.
- For each
- Critical tuning: keep
num_inference_steps=4(Klein is distilled). Higher counts blow latency and offer minimal quality gain at this scale.
Deliverable: for a 5-step recipe, all 6 images (final + 5 steps) render in:
- < 1 minute on T4 GPU Space
- < 8 minutes on CPU offload (acceptable only for pre-cached demos)
Verify: show the 6 images to an unprompted human; β₯4 should be described as "appetizing".
Phase 5 β Day 5: Narrator β VoxCPM2
Goal: every step's instruction is rendered to an MP3 in a warm, clear English voice.
Tasks
- Confirm the exact VoxCPM2 repo name on HF (
openbmb/VoxCPM2or similar). Read its README for the inference snippet β TTS APIs vary widely between models. - Add to
requirements.txt:soundfile,torchaudio,numpy. If VoxCPM2 ships GGUF, use it viallama-cpp-pythonaudio extension (if available); otherwise load viatransformersdirectly. - Write
src/models/narrator.py:_tts = None def get_tts(): global _tts if _tts is None: # placeholder β replace with the exact VoxCPM2 loading code from its README from transformers import AutoModel, AutoProcessor _tts = ... # load on CPU; VoxCPM2 is small (~1B) return _tts def synthesize(text: str, voice: str = "warm_female_en") -> bytes: """Returns MP3 bytes.""" tts = get_tts() wav = tts.generate(text, voice=voice) # API depends on VoxCPM2 # encode wav -> mp3 with soundfile + ffmpeg-python or pydub return mp3_bytes - Write
src/agents/narrator.py:- For each step, synthesize
step.instruction. Ifstep.tipis set, synthesize a separate "tip" clip. - Save MP3 files in a per-recipe temp directory; return file paths to Gradio.
- For each step, synthesize
- Pre-render all step audio when the recipe is finalized β never stream per-step in the demo (too much UI lag).
Deliverable: clicking "Play" on step 1 in the UI plays clear English narration.
Verify: on a 5-step recipe, total TTS rendering time < 30 seconds on CPU.
Phase 6 β Day 6: Gradio UI (Off-Brand)
Goal: the Space looks like a recipe magazine, not stock Gradio.
Tasks
- Write
src/ui/theme.py:import gradio as gr theme = gr.themes.Soft( primary_hue="orange", neutral_hue="stone", font=[gr.themes.GoogleFont("Inter"), "sans-serif"], font_mono=[gr.themes.GoogleFont("JetBrains Mono"), "monospace"], ) CSS = """ .gradio-container { background: #f5ecd9 !important; } .recipe-hero { background:#fffbf0; border-radius:14px; padding:28px; } .recipe-hero h1 { font-family:'Lora',serif!important; font-size:36px!important; color:#6b4a2a!important; } .step-card { background:#fffbf0; border-left:4px solid #a85c2a; border-radius:8px; padding:18px 22px; margin:12px 0; } .nutri-grid { display:grid; grid-template-columns:repeat(5,1fr); gap:12px; margin-top:24px; } .nutri-cell { background:#fffbf0; border:1px solid #d8c9ad; border-radius:10px; padding:12px; text-align:center; } """ - Write
app.pywith three tabs:- Tab 1 β Cook: fridge photo input β ingredient chips β 3 dish options β selected recipe card with hero image, steps (image + text + audio play button each), nutrition grid at the bottom.
- Tab 2 β Check Progress: upload a progress photo + select active step β validator returns badge (
go/wait/fix) + tip + audio. - Tab 3 β About / Tech: README-style explanation, badges, model list.
- Use
gr.Blockswithgr.Stateto hold the currentRecipePydantic object across UI events. Serialize to/fromdictsince Pydantic objects don't survive Gradio state by default β wrap instate.value = recipe.model_dump(). - Wire callbacks:
btn_propose.click(fn=on_propose, inputs=[fridge_photo], outputs=[ingredient_chips, dish_options, state])dish_options.select(fn=on_pick_dish, inputs=[state, picked_dish], outputs=[recipe_card, hero_img, steps_column, nutrition_grid, state])progress_image.upload(fn=on_validate, inputs=[state, current_step_idx, progress_image], outputs=[verdict_md, tip_audio])
Deliverable: end-to-end run from a sample fridge photo to a fully rendered recipe card with audio and nutrition. No Gradio default look anywhere.
Phase 7 β Day 7: Progress Validator (closed loop)
Goal: user uploads a progress photo, app says "go / wait / fix" with a voiced tip.
Tasks
- Write
src/agents/progress_validator.py:PROMPT = """Compare these two cooking photos. Photo 1 (target): how it should look after the step "{instruction}". Photo 2 (user's pan/plate): the user's current progress. Reply strict JSON: {"verdict": "go|wait|fix", "feedback": "...", "tip": "..."} - "go": looks right, move to next step - "wait": needs more time, do not change anything yet - "fix": something is off; suggest a concrete adjustment in one sentence """ def validate(target_img, user_img, step_instruction): ... - Use the same vision model singleton as Phase 2 β both calls share weights.
- Render the verdict as a colored badge (green/amber/red) and play the tip via VoxCPM2.
Deliverable: running the validator on 5 real progress photos returns the correct verdict on β₯3.
Phase 8 β Day 8: Fine-tune the Planner on the Kaggle dataset (Well-Tuned badge)
Important caveat: The user instruction says "for now keep inference on llama.cpp inside HF Space, future migration to Modal." Fine-tuning still requires GPU, so training itself happens on Modal (one-shot, offline) or on a rented Colab/Lambda GPU. Inference of the resulting model stays on llama.cpp inside the Space (as GGUF). This does not violate the runtime constraint β only the build pipeline touches a GPU.
Goal: publish a fine-tuned Planner GGUF to the Hub and load it from the Space.
Tasks
- Build SFT dataset (
scripts/build_sft_dataset.py):- Load Kaggle
better-recipesdataset. - For each recipe, build a
(prompt, completion)pair wherepromptis"Available ingredients: X, Y, Z. Propose recipe."andcompletionis the full canonicalRecipeJSON. - Generate ~1000 pairs, push to
<you>/cook-with-me-sftHF Dataset.
- Load Kaggle
- LoRA training (
scripts/train_planner.pyβ to be run on a GPU machine, not the Space):# peft + trl SFTTrainer, base = openbmb/MiniCPM-V-4 # r=16, alpha=32, lr=2e-4, epochs=2, batch=4 # push_to_hub=True, hub_model_id="<you>/cook-with-me-planner-4b" - Convert to GGUF (Day 8 evening):
- Use
llama.cpp/convert_hf_to_gguf.pythenquantizetoQ4_K_M. - Push GGUF to
<you>/cook-with-me-planner-4b-gguf.
- Use
- Update
src/models/loader.pyto point at your GGUF instead of the base model.
Deliverable: the Space loads your fine-tuned Planner GGUF and produces JSON recipes that are noticeably better-formatted than the base model on a held-out test set.
Phase 9 β Day 9: End-to-end test, performance pass, pre-warm cache
Goal: the Space loads in <60s and a full recipe (text + 5 images + 5 audios + nutrition) renders in <2 minutes on the chosen hardware.
Tasks
- Write
scripts/smoke_test.pythat runs the full pipeline on 3 sample fridge photos and asserts:- Each ingredient list is non-empty
- Each recipe has 5β7 steps
- Each step has a non-empty image and audio path
- Nutrition has all 5 macros set
- Implement on-disk caching for FLUX outputs (key = SHA256 of prompt) so re-runs of the same recipe are instant. Save to
~/.cache/cook-with-me/flux/. - Pre-render and commit 3 fully-prepared demo recipes (chicken tinga, pasta carbonara, chicken tikka) so judges see results in <5s on first click.
- Add error handling at every UI boundary: a model failure should display a friendly message, not a stack trace.
- Add a "Loading models..." progress bar on first request β first cold start can take 90s.
Deliverable: smoke test passes on the live Space.
Phase 10 β Day 10: README, demo video, social post, submit
Tasks
1. Write README.md with the required HF Space frontmatter:
```yaml
title: Cook With Me emoji: π² colorFrom: orange colorTo: yellow sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: apache-2.0
Followed by:
- One-paragraph pitch
- 60-second demo video embed
- Architecture diagram (export from `arquitectura.html` as PNG)
- Section: "How closed-loop visual cooking guidance works"
- Models used (with HF links + total parameter count)
- Badges declared
- Build / run instructions
2. Record a 60β90 second demo video: real person cooks a recipe end-to-end with the app guiding via voice, ending with the cooked plate on camera.
3. Write the Field Notes blog post: one of the engineering surprises (e.g., "FLUX.2 step images at 4 steps look better than 8 β here's why" or "Closed-loop validation needs the same vision model on both sides").
4. Social post on X / LinkedIn with the demo video.
5. Submit on the hackathon platform.
---
## 4. Tools usage matrix (when to reach for what)
| Phase | Primary tools | Why |
|---|---|---|
| 0 β setup | HF CLI, Kaggle CLI, OpenAI Codex CLI | one-shot config |
| 1 β data | `kagglehub`, `pandas`, `sentence-transformers` | offline dataset prep |
| 2 β vision | `llama-cpp-python` + `MiniCPMv26ChatHandler` | runs inside Space, badge: Llama Champion |
| 3 β planner | `llama-cpp-python` + retrieval over local parquet | grounded JSON output |
| 3.5 β nutrition | local CSV + regex parser | reliable, no LLM math |
| 4 β illustrator | `diffusers` + `Flux2KleinPipeline` | sponsor model showcase |
| 5 β narrator | VoxCPM2 via `transformers` (or its native API) | local TTS |
| 6 β UI | `gradio` + custom CSS theme | Off-Brand badge |
| 7 β validator | same vision singleton as phase 2 | closed-loop innovation, Best Agent |
| 8 β fine-tune | `peft`, `trl`, `llama.cpp` convert/quantize, on a GPU machine | Well-Tuned badge |
| 9 β test/cache | `pytest`, `hashlib`, on-disk FLUX cache | demo reliability |
| 10 β submit | HF Spaces, video tool, social | shipping |
---
## 5. Performance budget on the HF Space
| Operation | Target latency | Hardware needed |
|---|---|---|
| Vision: ingredient ID | < 8 s | CPU 4-thread |
| Planner: propose 3 dishes | < 12 s | CPU 4-thread |
| Planner: build full recipe JSON | < 20 s | CPU 4-thread |
| Nutrition computation | < 0.1 s | CPU |
| FLUX: 1 image (4 steps) | < 12 s on T4 / < 90 s on CPU offload | GPU strongly recommended |
| FLUX: 6 images (final + 5 steps) | < 80 s on T4 | GPU |
| VoxCPM2: 1 step narration | < 5 s | CPU |
| Validator: 1 progress check | < 8 s | CPU |
| **Full recipe end-to-end** | **< 2 min on T4 Space** | β |
**Hardware decision:** rent a T4 Space (~$0.40/hr) for the demo week. The $20 HF credits cover ~50 hours.
---
## 6. Risks and mitigations (delta from `estrategia.md`)
| Risk | Mitigation |
|---|---|
| MiniCPM-V-4 has no public GGUF | Convert yourself with `llama.cpp/convert_hf_to_gguf.py`. Allow a half-day buffer in Phase 2. |
| llama-cpp-python's MiniCPM-V chat handler version mismatch | Pin `llama-cpp-python==0.3.2` minimum; test the handler import on Day 2. If it fails, fall back to MiniCPM-V-2.6 GGUF (well-supported) for vision and document the swap. |
| FLUX.2 Klein 9B too slow on free CPU Space | Upgrade to a paid GPU Space (~$10 for the demo week). Document this in the README so judges expect it. |
| VoxCPM2 docs sparse | Drop to Kokoro-82M or Piper TTS as a backup. Lose the OpenBMB voice angle but keep the audio. |
| Kaggle dataset has format quirks (HTML in instructions, missing fields) | The Phase 1 normalization step handles this; budget 2 hours. |
| Nutrition CSV missing exotic ingredients | Skip-and-log strategy already designed; demo-day recipes use common ingredients only. |
| Total params >32B if VoxCPM2 turns out to be 7B | Check size in Phase 0; if too large, drop to a smaller TTS. |
---
## 7. "Day-1 hello world" checklist
Before writing any agent code, get this minimal end-to-end loop working β it proves your stack:
1. β Empty Gradio Space deployed, shows "Hello"
2. β `huggingface-cli login` works locally
3. β `kaggle datasets download thedevastator/better-recipes-for-a-better-life` succeeds
4. β `from llama_cpp import Llama` runs in your venv
5. β Download one tiny GGUF (e.g., TinyLlama Q4) and call it from a Gradio textbox round-trip
6. β Push the round-trip to the Space; confirm it answers in the cloud
**Only after all 6 are checked, start Phase 1.**
---
## 8. Where this plan differs from `estrategia.md` (deltas to communicate)
| Topic | `estrategia.md` (Spanish, Mexican-cuisine focus) | This document (current requirements) |
|---|---|---|
| Language | Spanish-first | **English only** |
| Cuisine | Mexican | **International** (Kaggle dataset) |
| Voice models | OpenBMB voice + Cohere Labs | **VoxCPM2** only (single voice) |
| Vision model | MiniCPM-V 2.6 / 4 | **MiniCPM-V-4.6** |
| Reasoning model | MiniCPM-4 4B | **MiniCPM-V-4** |
| FLUX runtime | Modal endpoint | **Inside Space (llama.cpp principle)**; Modal kept as a future migration target only |
| External APIs at runtime | Allowed (Modal, OpenAI optional) | **None** β full local inference inside Space |
| Nutritional info | Not specified | **Required** at end of recipe |
| Fine-tune dataset | 200 synthetic Mexican recipes | **Kaggle better-recipes (international)** |
If anything in `plan.md` or `estrategia.md` conflicts with this document, **this document wins** β it reflects the latest user requirements.
---
## 9. Definition of done
The implementation is complete when **all** of these are true:
- [ ] Public HF Space `https://huggingface.co/spaces/<you>/cook-with-me` loads
- [ ] App is fully in English
- [ ] Fridge photo β ingredient list β 3 dish options β full recipe with images, audio, and nutrition works end-to-end
- [ ] Progress validator returns sensible verdicts on 3+ test photos
- [ ] All inference (vision, planner, TTS) runs through llama.cpp / local diffusers β **no external API calls at runtime**
- [ ] Total parameters declared in README β€ 32B
- [ ] Fine-tuned Planner GGUF published to HF Hub (Well-Tuned badge)
- [ ] Demo video (60β90s) recorded with a real person cooking
- [ ] Field Notes blog post published
- [ ] Submitted on the hackathon platform before deadline