ObjectverseDiary / docs /MODEL_CARD.md
qqyule's picture
Deploy live MiniCPM-V vision defaults
0cadcec verified

A newer version of the Gradio SDK is available: 6.17.3

Upgrade

Model Card

Status

Stable local baseline plus live MiniCPM-V Space vision, one published text LoRA v2 adapter, and one published Q4_K_M GGUF. The public Gradio Space defaults to real MiniCPM-V object understanding with deterministic mock text; the GGUF has passed local llama.cpp smoke, but it has not been switched into the live Space runtime.

Local development defaults to deterministic mock backends. The hosted Space runs MiniCPM-V 2.6 vision on ZeroGPU with a hidden non-secret probe for diagnostics. Text generation has optional llama.cpp wiring for an externally configured GGUF model via TEXT_MODEL_PATH, but the live Space keeps text on the mock runtime for this release. A Modal LoRA v2 run completed, the adapter is published at https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora, and the merged Q4_K_M GGUF is published in the same repo.

Hosted MiniCPM-V validation passed after adding an HF_TOKEN Space secret with access to the gated openbmb/MiniCPM-V-2_6 model. The validation uses public mug, keyboard, and shoe images on ZeroGPU, while text generation intentionally remains mock. See docs/SPACE_VLM_REPORT.md.

Planned Components

  • Vision understanding: MiniCPM-V or lightweight fallback VLM.
  • Text generation: fine-tuned small LLM.
  • Runtime: llama.cpp / llama-cpp-python.

Candidate Architecture

Component Candidate Notes
Vision openbmb/MiniCPM-V-2_6 or mock fallback Live Space uses MiniCPM-V on ZeroGPU; local runtime can still default to mock.
Text deterministic mock text by default; published Qwen/Qwen2.5-1.5B-Instruct LoRA v2 Q4_K_M GGUF for local runtime Adapter and GGUF published; Space text runtime remains mock for the live vision release.
Runtime optional GGUF through llama.cpp / llama-cpp-python Wired with mock fallback; local GGUF smoke passed on 2026-06-08.
UI Gradio Blocks Required by the hackathon and project rules.

Parameter Budget

Total model parameters must remain <= 32B.

Record final numbers here before submission:

Component Model Parameters Counted Toward Total
Vision MiniCPM-V 2.6 optional path ~8B yes, when enabled
Text base Stable baseline mock text 0 no model parameters
Optional text base Qwen/Qwen2.5-1.5B-Instruct ~1.5B yes, when enabled
Published LoRA v2 GGUF qqyule/objectverse-diary-qwen15b-lora / objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf ~1.5B base, quantized file yes, if enabled
Published LoRA adapter qqyule/objectverse-diary-qwen15b-lora small adapter over base model yes, when enabled
Live Space total MiniCPM-V vision + mock text ~8B active model parameters <= 32B

If the optional MiniCPM-V 2.6 vision path and planned Qwen 1.5B text base are both enabled, the expected total remains about 9.5B plus a small LoRA adapter, safely under the 32B project budget.

Intended Inputs And Outputs

Inputs:

  • user-uploaded everyday object photo
  • optional object description
  • personality mode

Outputs:

  • structured object understanding JSON
  • hidden object persona JSON
  • short English-first diary with Chinese helper text
  • object chat response
  • share card preview
  • anonymized trace record

Dataset Notes

Dataset planning lives in docs/DATASET.md.

Current preview data is deterministic and mock-generated. It should only be used for schema validation, dry-run validation, and workflow planning until real candidate samples are generated and curated.

The Modal training scaffold defaults to Qwen/Qwen2.5-1.5B-Instruct and saves adapter artifacts to a Modal Volume. data/train/objectverse_sft_curated_v2.jsonl contains 200 synthetic curated rows covering 40 everyday objects and 5 personality modes. It is published at https://huggingface.co/datasets/qqyule/objectverse-diary-sft-curated as objectverse_sft_curated_v2.jsonl.

Published adapter:

https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora

Current v2 training run summary:

  • Platform: Modal
  • Run name: objectverse-diary-qwen15b-lora-v2
  • Base model: Qwen/Qwen2.5-1.5B-Instruct
  • Dataset: 200 synthetic curated v2 rows
  • Train / eval rows: 180 / 20
  • Steps: 120
  • Max sequence length: 1536
  • Learning rate: 0.0001
  • Effective batch size: 8
  • LoRA rank / alpha / dropout: 32 / 64 / 0.05
  • Assistant-output-only loss: enabled
  • Train loss: 0.3240
  • Eval loss: 0.0162
  • Epoch: 5.2222
  • GGUF conversion: completed with pinned llama.cpp commit 8f83d6c271d194bde2d410145a0ce73bc42e85cd
  • Published GGUF: objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf

GGUF smoke status:

  • Repo: qqyule/objectverse-diary-qwen15b-lora
  • File: objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf
  • Local helper: scripts/check_llama_cpp_smoke.py
  • Local result: passed on 2026-06-08 with llama-cpp text generation, no text-fallback-to-mock, schema-valid persona and diary, and non-empty chat reply.
  • Space result: not run; do not claim live Space text runtime until a separate Space validation passes.

Safety And Privacy

  • Do not use OpenAI, Anthropic, Gemini, Cohere, or other commercial model APIs.
  • Do not publish private user photos or unconsented personal data.
  • Do not include tokens, credit codes, emails, serial numbers, or credentials.
  • Keep raw private traces out of public datasets.

Fallback Behavior

  • If VLM loading fails, use manual description and stable example flow.
  • If llama.cpp is not installed, TEXT_MODEL_PATH is missing, model loading fails, or output JSON is invalid, keep deterministic mock text fallback for demo safety.
  • If model JSON is invalid, repair and validate before rendering.
  • Runtime traces do not record literal TEXT_MODEL_PATH; they only record that an external GGUF path is configured.
  • Hosted VLM validation evidence is preserved in data/traces/space-vlm/. These traces use real MiniCPM-V object understanding plus mock text generation and should not be described as full real-text-runtime traces.

Required Notes

  • Total model parameter count must remain <= 32B.
  • No commercial model APIs.
  • Fallback behavior must be documented.
  • Dataset provenance and privacy rules must be documented before release.