Spaces:

build-small-hackathon
/

third-eye

Sleeping

App Files Files Community

third-eye / BLOG.md

mitvho09

Upload folder using huggingface_hub

031e3f9 verified 17 days ago

preview code

Raw

History Blame Contribute Delete

2.16 kB

	# What VLM Quality Really Feels Like at 2.8B

	## Field Notes draft

	Small multimodal models change the product conversation. The question is no longer only
	"Can it answer?" It is "Can it answer quickly and reliably enough that a blind user will trust
	it with a menu, a medicine label, or a street sign?"

	Third Eye uses OpenBMB's 2.8B MiniCPM-V-2 as its primary visual model. That size matters: it
	keeps the project inside the Tiny Titan budget and creates a plausible path toward private,
	offline inference. It also forces honest product choices.

	## What worked

	- High-contrast printed text is the strongest input category.
	- Short, explicit prompts outperform broad requests.
	- Asking the model to separate observed text from interpretation reduces ambiguity.
	- A large transcript remains essential even when speech synthesis is available.

	## Where 2.8B needs care

	- Dense menus and curved labels can produce omissions.
	- Scene descriptions should not be treated as navigation instructions.
	- OCR must preserve uncertainty instead of inventing unreadable characters.
	- The app needs a retry path and must never hide the original image from the user.

	## Evaluation plan

	The bundled menu, medicine label, and station sign form the first reproducible test set.
	For each image we will record exact-text recall, hallucinated words, latency after warm start,
	latency after cold start, and whether the spoken answer preserves the transcript.

	The 8B MiniCPM-V-4_5 fallback is not enabled. If later testing shows that the 2.8B model is
	not usable, switching must be explicit because it forfeits the Tiny Titan claim.

	The primary model's documented language coverage is English and Chinese. That constraint
	removed the planned Hindi demo from the honest MVP rather than hiding a translation model
	outside the sponsor stack.

	## The edge lesson

	"Small enough to quantize" is not the same as "shipped on a phone." The honest roadmap is to
	benchmark int4 vision first, then measure the speech models on representative hardware. Privacy
	and offline availability are the reason to do that work, not a marketing shortcut.