ObjectverseDiary / docs /MODEL_CARD.md
qqyule's picture
Deploy live MiniCPM-V vision defaults
0cadcec verified
# Model Card
## Status
Stable local baseline plus live MiniCPM-V Space vision, one published text LoRA v2 adapter, and one published Q4_K_M GGUF. The public Gradio Space defaults to real MiniCPM-V object understanding with deterministic mock text; the GGUF has passed local llama.cpp smoke, but it has not been switched into the live Space runtime.
Local development defaults to deterministic mock backends. The hosted Space runs MiniCPM-V 2.6 vision on ZeroGPU with a hidden non-secret probe for diagnostics. Text generation has optional llama.cpp wiring for an externally configured GGUF model via `TEXT_MODEL_PATH`, but the live Space keeps text on the mock runtime for this release. A Modal LoRA v2 run completed, the adapter is published at `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`, and the merged Q4_K_M GGUF is published in the same repo.
Hosted MiniCPM-V validation passed after adding an `HF_TOKEN` Space secret with access to the gated `openbmb/MiniCPM-V-2_6` model. The validation uses public mug, keyboard, and shoe images on ZeroGPU, while text generation intentionally remains mock. See `docs/SPACE_VLM_REPORT.md`.
## Planned Components
- Vision understanding: MiniCPM-V or lightweight fallback VLM.
- Text generation: fine-tuned small LLM.
- Runtime: llama.cpp / llama-cpp-python.
## Candidate Architecture
| Component | Candidate | Notes |
| --- | --- | --- |
| Vision | `openbmb/MiniCPM-V-2_6` or mock fallback | Live Space uses MiniCPM-V on ZeroGPU; local runtime can still default to mock. |
| Text | deterministic mock text by default; published `Qwen/Qwen2.5-1.5B-Instruct` LoRA v2 Q4_K_M GGUF for local runtime | Adapter and GGUF published; Space text runtime remains mock for the live vision release. |
| Runtime | optional GGUF through llama.cpp / llama-cpp-python | Wired with mock fallback; local GGUF smoke passed on 2026-06-08. |
| UI | Gradio Blocks | Required by the hackathon and project rules. |
## Parameter Budget
Total model parameters must remain <= 32B.
Record final numbers here before submission:
| Component | Model | Parameters | Counted Toward Total |
| --- | --- | ---: | --- |
| Vision | MiniCPM-V 2.6 optional path | ~8B | yes, when enabled |
| Text base | Stable baseline mock text | 0 | no model parameters |
| Optional text base | `Qwen/Qwen2.5-1.5B-Instruct` | ~1.5B | yes, when enabled |
| Published LoRA v2 GGUF | `qqyule/objectverse-diary-qwen15b-lora` / `objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf` | ~1.5B base, quantized file | yes, if enabled |
| Published LoRA adapter | `qqyule/objectverse-diary-qwen15b-lora` | small adapter over base model | yes, when enabled |
| Live Space total | MiniCPM-V vision + mock text | ~8B active model parameters | <= 32B |
If the optional MiniCPM-V 2.6 vision path and planned Qwen 1.5B text base are both enabled, the expected total remains about 9.5B plus a small LoRA adapter, safely under the 32B project budget.
## Intended Inputs And Outputs
Inputs:
- user-uploaded everyday object photo
- optional object description
- personality mode
Outputs:
- structured object understanding JSON
- hidden object persona JSON
- short English-first diary with Chinese helper text
- object chat response
- share card preview
- anonymized trace record
## Dataset Notes
Dataset planning lives in `docs/DATASET.md`.
Current preview data is deterministic and mock-generated. It should only be used for schema validation, dry-run validation, and workflow planning until real candidate samples are generated and curated.
The Modal training scaffold defaults to `Qwen/Qwen2.5-1.5B-Instruct` and saves adapter artifacts to a Modal Volume. `data/train/objectverse_sft_curated_v2.jsonl` contains 200 synthetic curated rows covering 40 everyday objects and 5 personality modes. It is published at `https://huggingface.co/datasets/qqyule/objectverse-diary-sft-curated` as `objectverse_sft_curated_v2.jsonl`.
Published adapter:
```text
https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora
```
Current v2 training run summary:
- Platform: Modal
- Run name: `objectverse-diary-qwen15b-lora-v2`
- Base model: `Qwen/Qwen2.5-1.5B-Instruct`
- Dataset: 200 synthetic curated v2 rows
- Train / eval rows: 180 / 20
- Steps: 120
- Max sequence length: 1536
- Learning rate: 0.0001
- Effective batch size: 8
- LoRA rank / alpha / dropout: 32 / 64 / 0.05
- Assistant-output-only loss: enabled
- Train loss: 0.3240
- Eval loss: 0.0162
- Epoch: 5.2222
- GGUF conversion: completed with pinned `llama.cpp` commit `8f83d6c271d194bde2d410145a0ce73bc42e85cd`
- Published GGUF: `objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf`
GGUF smoke status:
- Repo: `qqyule/objectverse-diary-qwen15b-lora`
- File: `objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf`
- Local helper: `scripts/check_llama_cpp_smoke.py`
- Local result: passed on 2026-06-08 with `llama-cpp text generation`, no `text-fallback-to-mock`, schema-valid persona and diary, and non-empty chat reply.
- Space result: not run; do not claim live Space text runtime until a separate Space validation passes.
## Safety And Privacy
- Do not use OpenAI, Anthropic, Gemini, Cohere, or other commercial model APIs.
- Do not publish private user photos or unconsented personal data.
- Do not include tokens, credit codes, emails, serial numbers, or credentials.
- Keep raw private traces out of public datasets.
## Fallback Behavior
- If VLM loading fails, use manual description and stable example flow.
- If llama.cpp is not installed, `TEXT_MODEL_PATH` is missing, model loading fails, or output JSON is invalid, keep deterministic mock text fallback for demo safety.
- If model JSON is invalid, repair and validate before rendering.
- Runtime traces do not record literal `TEXT_MODEL_PATH`; they only record that an external GGUF path is configured.
- Hosted VLM validation evidence is preserved in `data/traces/space-vlm/`. These traces use real MiniCPM-V object understanding plus mock text generation and should not be described as full real-text-runtime traces.
## Required Notes
- Total model parameter count must remain <= 32B.
- No commercial model APIs.
- Fallback behavior must be documented.
- Dataset provenance and privacy rules must be documented before release.