Spaces:
Sleeping
Sleeping
| # What VLM Quality Really Feels Like at 2.8B | |
| ## Field Notes draft | |
| Small multimodal models change the product conversation. The question is no longer only | |
| "Can it answer?" It is "Can it answer quickly and reliably enough that a blind user will trust | |
| it with a menu, a medicine label, or a street sign?" | |
| Third Eye uses OpenBMB's 2.8B MiniCPM-V-2 as its primary visual model. That size matters: it | |
| keeps the project inside the Tiny Titan budget and creates a plausible path toward private, | |
| offline inference. It also forces honest product choices. | |
| ## What worked | |
| - High-contrast printed text is the strongest input category. | |
| - Short, explicit prompts outperform broad requests. | |
| - Asking the model to separate observed text from interpretation reduces ambiguity. | |
| - A large transcript remains essential even when speech synthesis is available. | |
| ## Where 2.8B needs care | |
| - Dense menus and curved labels can produce omissions. | |
| - Scene descriptions should not be treated as navigation instructions. | |
| - OCR must preserve uncertainty instead of inventing unreadable characters. | |
| - The app needs a retry path and must never hide the original image from the user. | |
| ## Evaluation plan | |
| The bundled menu, medicine label, and station sign form the first reproducible test set. | |
| For each image we will record exact-text recall, hallucinated words, latency after warm start, | |
| latency after cold start, and whether the spoken answer preserves the transcript. | |
| The 8B MiniCPM-V-4_5 fallback is not enabled. If later testing shows that the 2.8B model is | |
| not usable, switching must be explicit because it forfeits the Tiny Titan claim. | |
| The primary model's documented language coverage is English and Chinese. That constraint | |
| removed the planned Hindi demo from the honest MVP rather than hiding a translation model | |
| outside the sponsor stack. | |
| ## The edge lesson | |
| "Small enough to quantize" is not the same as "shipped on a phone." The honest roadmap is to | |
| benchmark int4 vision first, then measure the speech models on representative hardware. Privacy | |
| and offline availability are the reason to do that work, not a marketing shortcut. | |