third-eye / BLOG.md
mitvho09's picture
Upload folder using huggingface_hub
031e3f9 verified
|
Raw
History Blame Contribute Delete
2.16 kB
# What VLM Quality Really Feels Like at 2.8B
## Field Notes draft
Small multimodal models change the product conversation. The question is no longer only
"Can it answer?" It is "Can it answer quickly and reliably enough that a blind user will trust
it with a menu, a medicine label, or a street sign?"
Third Eye uses OpenBMB's 2.8B MiniCPM-V-2 as its primary visual model. That size matters: it
keeps the project inside the Tiny Titan budget and creates a plausible path toward private,
offline inference. It also forces honest product choices.
## What worked
- High-contrast printed text is the strongest input category.
- Short, explicit prompts outperform broad requests.
- Asking the model to separate observed text from interpretation reduces ambiguity.
- A large transcript remains essential even when speech synthesis is available.
## Where 2.8B needs care
- Dense menus and curved labels can produce omissions.
- Scene descriptions should not be treated as navigation instructions.
- OCR must preserve uncertainty instead of inventing unreadable characters.
- The app needs a retry path and must never hide the original image from the user.
## Evaluation plan
The bundled menu, medicine label, and station sign form the first reproducible test set.
For each image we will record exact-text recall, hallucinated words, latency after warm start,
latency after cold start, and whether the spoken answer preserves the transcript.
The 8B MiniCPM-V-4_5 fallback is not enabled. If later testing shows that the 2.8B model is
not usable, switching must be explicit because it forfeits the Tiny Titan claim.
The primary model's documented language coverage is English and Chinese. That constraint
removed the planned Hindi demo from the honest MVP rather than hiding a translation model
outside the sponsor stack.
## The edge lesson
"Small enough to quantize" is not the same as "shipped on a phone." The honest roadmap is to
benchmark int4 vision first, then measure the speech models on representative hardware. Privacy
and offline availability are the reason to do that work, not a marketing shortcut.