OceanirAI
/

Oculus

Image-Text-to-Text

vision-language

image-captioning

object-detection

Model card Files Files and versions

Oculus / docs /TRAINING_ROADMAP.md

kobiakor15's picture

Upload docs/TRAINING_ROADMAP.md with huggingface_hub

4145f82 verified 7 days ago

|

history blame contribute delete

877 Bytes

	# 🚀 Oculus V3: Future Training Roadmap

	COCO (Current) = 80 common classes. Good baseline, but limited for real-world niche tasks.

	## Option A: Universal Detection (The "Scanner")
	Target: Detect 1200+ specific objects.
	- Dataset: LVIS or Objects365.
	- Result: Recognizes "stapler", "doorknob", "mango" instead of just generic classes.

	## Option B: Visual Reasoning (The "Thinker")
	Target: Better VQA and complex instruction following.
	- Dataset: LLaVA-Instruct or VizWiz.
	- Why: Teaches the model to "explain why the car is parked" rather than just finding the car.
	- Result: A smarter chatbot-like VLM.

	## Recommendation
	Since Oceanir is a VLM platform, Option B (Instruction Tuning) is the highest value next step. It improves the model's IO (Intelligence Output) significantly more than just adding more bounding boxes.