omniff / ARCHITECTURE.md

Initial upload: OmniFF — FFmpeg for AI

88e3f4a verified 8 days ago

27.9 kB

	# Saken OmniFF — Architecture Whitepaper

	Full name: Saken OmniFF
	Runtime name: OmniFF Runtime
	Kazakh-first variant: OmniFF-KZ
	Core formula: FFmpeg for AI inference, generation, and multimodal transformation

	Document status: architectural doctrine
	Purpose: single canonical document for design, implementation, publication, and explanation

	---

	## Naming Policy

	\| Surface \| Identifier \|
	\|---------\|-----------\|
	\| CLI binary \| `omniff` \|
	\| Python package \| `saken-omniff` \|
	\| Rust crates \| `omniff-core`, `omniff-graph`, `omniff-runtime`, `omniff-cli` \|
	\| NPM package \| `@saken/omniff` \|
	\| GitHub \| `stukenov/omniff` \|
	\| Hugging Face \| `stukenov/omniff-runtime` \|

	All public APIs, imports, configs, and docs must use these canonical names. No aliases.

	---

	## 1. Summary

	OmniFF Runtime is not a neural network model. It is a universal multimodal runtime that accepts any input — text, audio, image, video, documents, structured data — and transforms them into any output modality through a managed graph of models, filters, validators, and planners.

	Architecture inspired by FFmpeg:

	```
	container → demuxer → decoder → filtergraph → encoder → muxer
	```

	For AI this becomes:

	```
	input container
	→ demuxer
	→ modality decoder
	→ OmniFrame normalization
	→ Thinking+ planner
	→ AI filtergraph
	→ model experts
	→ validators
	→ output encoder
	→ muxer
	```

	Key distinction from ordinary LLM systems: the model is not the center of the product. The model is one computational node. The center is the runtime, the graph, and the unified format for processing multimodal streams.

	### Full modality matrix

	```
	text → text, image, video, audio
	image → text, image, video
	video → text, image, video
	audio → text, audio, video
	document → text, document
	mixed input → mixed output
	```

	The system does not pretend to be one monolithic model. It is a unified runtime with many specialized models inside.

	---

	## 2. Core Doctrine

	### 2.1. This is a runtime, not a model

	OmniFF Runtime is not one Transformer, not one `.safetensors`, not one decoder-only LLM.

	Correct definitions:

	```
	Omni inference runtime
	Multimodal graph engine
	AI media processing framework
	Routed multimodal model system
	FFmpeg-like AI runtime
	```

	A model is only one type of node inside the runtime.

	### 2.2. FFmpeg Principle

	FFmpeg became foundational not because it was one codec. It became foundational because it gave a common language for containers, streams, codecs, filters, transformations, and output.

	OmniFF Runtime does the same for AI:

	```
	media streams → AI streams
	frames → OmniFrames
	filters → AI filters
	codecs → model experts
	filtergraph → OmniGraph
	metadata → prompt/control side data
	muxing → multimodal output assembly
	```

	### 2.3. Models are codecs

	```
	LLM = text codec / reasoning codec
	Whisper = audio perception codec
	VLM = vision perception codec
	Image diffusion = image generation codec
	Video diffusion = video generation codec
	TTS = speech generation codec
	Encoder router = routing codec
	OCR = document perception codec
	```

	A model does not control the system. A model executes its role in the graph.

	---

	## 3. Architectural Laws

	### Law 1. Runtime over model

	The product must not be hostage to one model, one provider, one tokenizer, one inference engine, or one weight format. Models can be swapped. The runtime must remain.

	### Law 2. Graph over pipeline

	The system must not be a collection of hardcoded pipeline scripts. All transformations must be expressed through a graph.

	Wrong:

	```
	if image then do this
	if video then do that
	if audio then do another script
	```

	Right:

	```
	input → graph planner → DAG → execution → validation → output
	```

	### Law 3. Prompt is a control layer, not just a string

	Prompt must be represented as structured side data:

	- user prompt
	- system prompt
	- task prompt
	- modality prompt
	- generation prompt
	- negative prompt
	- control prompt
	- validator prompt
	- constraints, seed, strength, masks
	- reference assets, preservation rules
	- thinking budget

	### Law 4. Thinking+ is a planner, not a final answer

	Thinking+ is a control module that builds an execution plan, selects a graph, assigns models, sets constraints, launches validators, and decides on retry. The user receives a brief execution summary, not the internal chain of reasoning.

	### Law 5. Router must be cheap

	Router must not be a large LLM. Router must be an encoder-only or other cheap classifier.

	```
	prompt / normalized semantic state
	→ encoder-only classifier
	→ selected model / route / graph
	```

	Router does not generate answers. Router selects routes.

	### Law 6. Do not run all models always

	Wrong:

	```
	0.6B → 4B → 14B → 32B always
	```

	Right:

	```
	router → minimum sufficient model
	```

	Only on failure:

	```
	fallback / escalation / validation retry
	```

	### Law 7. Omni-directions require separate generative branches

	An LLM or VLM alone cannot close image-to-image and video-to-video. Separate branches needed:

	- image generation, editing, inpainting, ControlNet-like control
	- video generation, video-to-video, temporal consistency
	- audio generation, TTS
	- document rendering

	### Law 8. One external product, many internal experts

	Outside: one API, one CLI, one SDK, one HF repo, one Docker.

	Inside: modular —

	```
	router, ASR, VLM, LLM, image generator, video generator,
	TTS, OCR, document parser, validators, scheduler
	```

	### Law 9. Architectural honesty over marketing

	Cannot pretend this is one monolithic neural network.

	```
	A FFmpeg-like multimodal AI runtime with routed model experts
	and thinking-controlled graph execution.
	```

	---

	## 4. Core Architecture

	### 4.1. Top-level flow

	```
	User input
	↓
	Input container
	↓
	Demuxer
	↓
	Modality decoder
	↓
	OmniPacket / OmniFrame
	↓
	Normalization
	↓
	Thinking+ Planner
	↓
	Router
	↓
	OmniGraph
	↓
	Model/filter execution
	↓
	Validators
	↓
	Output encoder
	↓
	Muxer
	↓
	Final output
	```

	### 4.2. Core libraries

	By analogy with FFmpeg:

	\| Library \| Responsibility \|
	\|---------\|---------------\|
	\| `libomniformat` \| input/output containers, demux/mux \|
	\| `libomnimodel` \| model loading and execution \|
	\| `libomnifilter` \| AI filters and transformations \|
	\| `libomnigraph` \| DAG planning and execution \|
	\| `libomnimemory` \| tensors, frames, cache, device placement \|
	\| `libomnivalidate` \| validators, critics, constraint checks \|
	\| `libomnischedule` \| scheduling, batching, GPU/CPU placement \|
	\| `libomniapi` \| CLI, SDK, HTTP API \|

	### 4.3. Runtime entities

	#### OmniPacket

	Raw input fragment:

	```
	text chunk, audio bytes, video packet, image bytes,
	PDF page, JSON message, subtitle segment, metadata block
	```

	#### OmniFrame

	Normalized processing object:

	```
	text tokens, audio PCM, image tensor, video frame,
	embedding, mask, depth map, pose map, transcript,
	OCR layer, scene graph, semantic state, control map
	```

	#### OmniGraph

	DAG describing task execution:

	```
	nodes = models / filters / tools / validators
	edges = data dependencies
	side data = prompts / controls / constraints
	```

	#### OmniNode

	One executable node:

	```
	ASR node, VLM node, LLM node, image generation node,
	video generation node, OCR node, validator node,
	ffmpeg utility node, scheduler node
	```

	#### OmniModel

	Model wrapper:

	```
	load, unload, infer, generate, stream,
	batch, quantize, cache
	```

	#### OmniFilter

	Data transformation:

	```
	resize, crop, normalize, extract_depth, extract_edges,
	extract_pose, detect_faces, track_objects, split_shots,
	summarize, translate, style_transfer
	```

	#### OmniValidator

	Result verification:

	```
	language check, schema check, visual prompt adherence,
	face preservation, temporal consistency, OCR correctness,
	toxicity/safety check, factuality check, format validation
	```

	---

	## 5. Routing

	### 5.1. Router role

	Router does not answer the user. Router selects:

	- which graph template to use
	- which models to invoke
	- which thinking level to enable
	- which validator is needed
	- which escalation policy applies

	### 5.2. Encoder-only router

	Preferred architecture:

	```
	XLM-R / ModernBERT / BGE-style encoder
	+ classification head
	→ route class
	```

	Output:

	```json
	{
	"selected_route": "image_to_image",
	"confidence": 0.87,
	"risk": "low",
	"thinking": "normal"
	}
	```

	### 5.3. Route classes

	```
	TEXT_SIMPLE
	TEXT_NORMAL
	TEXT_COMPLEX
	AUDIO_TRANSCRIBE_ONLY
	AUDIO_QA
	IMAGE_CAPTION
	IMAGE_EDIT
	TEXT_TO_IMAGE
	TEXT_TO_VIDEO
	IMAGE_TO_VIDEO
	VIDEO_SUMMARY
	VIDEO_TO_VIDEO
	DOCUMENT_OCR_QA
	DOCUMENT_TO_DOCUMENT
	REJECT_OR_HUMAN_REVIEW
	```

	### 5.4. Model ladder (text/reasoning)

	If using Qwen family as text/reasoning backbone:

	```
	Qwen3-0.6B router-assistant / cheap tasks
	Qwen3-4B normal assistant
	Qwen3-14B hard tasks
	Qwen3-32B local high-quality / judge
	```

	Production minimum:

	```
	Qwen3-0.6B (cheap/fast)
	Qwen3-4B (normal)
	Qwen3-14B (complex)
	Qwen3-32B (judge)
	```

	Router selects the minimum sufficient model. Escalation only on failure.

	---

	## 6. Thinking+

	### 6.1. Purpose

	Thinking+ is a control layer for planning and execution control, not just "the model thinks longer."

	Thinking+ must:

	1. Understand the task
	2. Determine input/output modalities
	3. Select graph template
	4. Choose models
	5. Assign validators
	6. Set constraints
	7. Define retry policy
	8. Form execution plan

	### 6.2. Thinking levels

	```
	thinking=off
	fast routing, single pass, minimal checking

	thinking=fast
	router + simple graph

	thinking=normal
	planner + executor + validator

	thinking=deep
	planner + executor + critic + retry

	thinking=research
	multiple candidates + judge + detailed validation
	```

	### 6.3. Execution plan example

	```json
	{
	"task": "video_to_video",
	"preserve": ["faces", "voice", "camera_structure"],
	"style": "premium minimal corporate",
	"required_nodes": [
	"shot_detection",
	"face_tracking",
	"audio_transcription",
	"style_transfer_video",
	"temporal_validator",
	"audio_mux"
	],
	"risk": "high",
	"validator": "vlm_video_judge",
	"retry_policy": "up_to_2"
	}
	```

	### 6.4. User-facing output

	Internal reasoning chain is never mandatory output. User receives brief route explanation:

	```json
	{
	"mode": "deep",
	"route": "image_to_image",
	"controls": ["depth", "mask", "reference"],
	"generator": "image_edit_model",
	"validator": "vision_validator"
	}
	```

	---

	## 7. Prompt Control

	### 7.1. Prompt as side data

	Every OmniFrame carries side data:

	```json
	{
	"prompt": "matte graphite car wrap",
	"negative_prompt": "cartoon, distorted wheels, wrong car shape",
	"seed": 42,
	"strength": 0.35,
	"preserve_identity": true,
	"preserve_layout": true,
	"control_maps": ["depth", "canny", "mask"],
	"thinking_budget": 2048,
	"validator_threshold": 0.82
	}
	```

	### 7.2. Prompt layers

	```
	system prompt → global behavior
	user prompt → user intent
	task prompt → task-specific instructions
	modality prompt → per-modality hints
	generation prompt → enriched generation instruction
	negative prompt → what to avoid
	control prompt → structural control
	validator prompt → validation criteria
	```

	### 7.3. Image prompt control

	```json
	{
	"user_prompt": "Make the car matte graphite",
	"generation_prompt": "black Hyundai Elantra 2021, matte graphite wrap, premium realistic studio lighting",
	"negative_prompt": "cartoon, damaged car, wrong wheels, deformed body",
	"preserve": {
	"car_model": true,
	"camera_angle": true,
	"body_shape": true,
	"background": false
	},
	"strength": 0.38,
	"controls": ["canny", "depth"]
	}
	```

	### 7.4. Video prompt control

	```json
	{
	"global_prompt": "cinematic corporate video, clean premium style",
	"shot_prompts": [
	{"shot": 1, "prompt": "slow dolly-in, preserve subject identity"},
	{"shot": 2, "prompt": "soft lighting, premium office mood"}
	],
	"negative_prompt": "flickering, face distortion, unstable hands, warped text",
	"temporal_consistency": "high",
	"style_strength": 0.45
	}
	```

	---

	## 8. Omni-Directions

	### 8.1. Text → Text

	Purpose: answers, analysis, translation, correction, classification, legal, code, RAG, structured output.

	```
	text input → language detection → router → LLM expert → validator → text output
	```

	### 8.2. Audio → Text

	Purpose: transcription, speech translation, meeting summary, call center, lectures.

	```
	audio → VAD/chunking → ASR → transcript cleanup → language correction
	→ LLM/summary/QA → text output
	```

	### 8.3. Audio → Audio

	Purpose: speech-to-speech assistant, dubbing, voice translation, call center automation.

	```
	audio → ASR → LLM → TTS → audio encoder
	```

	### 8.4. Audio → Video

	Purpose: podcast visualization, music video generation, audio-driven animation.

	```
	audio → ASR/analysis → scene planner → video generation → audio mux → output
	```

	### 8.5. Image → Text

	Purpose: captioning, OCR, visual QA, document analysis, screenshot understanding.

	```
	image → image decoder → VLM/OCR → normalized text/scene graph → router → LLM → text output
	```

	### 8.6. Text → Image

	Purpose: image generation, visual concepts, design, ads, UI mockups.

	```
	text prompt → prompt planner → image generation model → image validator → image encoder
	```

	### 8.7. Image → Image

	Purpose: image editing, stylization, inpainting, outpainting, color change, shape preservation, reference-based generation.

	```
	image + prompt → image analysis → mask/control extraction → edit planner
	→ image edit model → vision validator → output image
	```

	Controls: mask, depth, canny, pose, segmentation, reference image, style strength, identity preservation, layout preservation, negative prompt, seed.

	### 8.8. Text → Video

	Purpose: clip generation, ads, storyboard-to-video, presentation videos.

	```
	text prompt → scene planner → shot list → video generation model
	→ temporal validator → video encoder
	```

	### 8.9. Image → Video

	Purpose: image animation, motion prompt, avatar video, product animation.

	```
	image + motion prompt → image analysis → motion planner
	→ image-to-video model → temporal validator → video output
	```

	### 8.10. Video → Text

	Purpose: video summary, lecture analysis, surveillance analysis, meeting extraction, content indexing.

	```
	video → demux audio/video → shot detection → keyframe extraction → ASR
	→ VLM analysis → multimodal summary → text output
	```

	### 8.11. Video → Image

	Purpose: keyframe extraction, thumbnail generation, scene capture.

	```
	video → shot detection → keyframe selection → VLM analysis → best frame selection
	→ optional image enhancement → image output
	```

	### 8.12. Video → Video

	Purpose: style transfer, enhancement, cinematic transformation, face/body/background preservation, corporate video transformation, generative editing.

	```
	video → demux → shot detection → keyframe extraction → audio transcription
	→ motion analysis → depth/pose/edge maps → video edit planner
	→ video generation/editing model → temporal consistency filter
	→ audio restoration/mux → video output
	```

	Controls: global prompt, per-shot prompt, negative prompt, motion strength, style strength, identity preservation, camera preservation, seed per shot, control maps per frame, mask tracks, temporal consistency.

	### 8.13. Document → Text

	Purpose: PDF analysis, contract review, law analysis, table extraction, OCR, document QA.

	```
	document → parser/OCR → layout extraction → chunks → retrieval/reasoning → text output
	```

	### 8.14. Document → Document

	Purpose: contracts, whitepapers, PRDs, technical specs, government letters, Word/PDF/slides generation.

	```
	document/input text → structure planner → content generator
	→ format renderer → validator → output document
	```

	---

	## 9. Image-to-Image Architecture

	### 9.1. Why image-to-image is not LLM-only

	LLM can understand an instruction but must not be the sole image generator.

	```
	LLM/VLM = understand and plan
	Image model = generate/edit
	Validator = check
	```

	### 9.2. Image-to-image nodes

	```
	decode_image
	analyze_image_with_vlm
	extract_mask
	extract_depth
	extract_edges
	extract_pose
	plan_edit_with_thinking
	run_image_edit_model
	validate_prompt_adherence
	validate_preservation
	encode_image
	```

	### 9.3. Example graph

	```json
	{
	"nodes": [
	{"id": "analyze_image", "model": "vlm"},
	{"id": "extract_depth", "model": "depth"},
	{"id": "extract_mask", "model": "sam"},
	{"id": "plan_edit", "model": "llm_thinking"},
	{"id": "generate_image", "model": "image_edit"},
	{"id": "validate", "model": "vision_validator"}
	],
	"edges": [
	["analyze_image", "plan_edit"],
	["extract_depth", "generate_image"],
	["extract_mask", "generate_image"],
	["plan_edit", "generate_image"],
	["generate_image", "validate"]
	]
	}
	```

	---

	## 10. Video-to-Video Architecture

	### 10.1. Why video-to-video is harder than image-to-image

	Processing each frame independently causes:

	- flickering
	- identity loss
	- face distortion
	- motion destruction
	- unstable style
	- inter-frame artifacts

	Video-to-video requires temporal consistency.

	### 10.2. Video-to-video nodes

	```
	demux_video_audio
	decode_video_frames
	shot_detection
	keyframe_selection
	transcribe_audio
	analyze_keyframes
	extract_depth_sequence
	extract_pose_sequence
	extract_edges_sequence
	track_faces
	track_objects
	plan_shots_with_thinking
	run_video_edit_model
	temporal_consistency_filter
	restore_audio
	encode_video
	mux_audio_video
	validate_video
	```

	### 10.3. Example graph

	```json
	{
	"nodes": [
	{"id": "split_video", "tool": "ffmpeg"},
	{"id": "analyze_keyframes", "model": "vlm"},
	{"id": "transcribe_audio", "model": "asr"},
	{"id": "extract_motion", "tool": "optical_flow"},
	{"id": "make_control_maps", "models": ["depth", "canny", "pose"]},
	{"id": "plan_shots", "model": "llm_thinking"},
	{"id": "generate_video", "model": "video_diffusion"},
	{"id": "restore_audio", "tool": "ffmpeg"},
	{"id": "validate_video", "model": "video_validator"}
	]
	}
	```

	---

	## 11. Runtime Scheduling

	### 11.1. Scheduler as critical layer

	Without a scheduler the system becomes a slow Python script. Scheduler must manage:

	- CPU/GPU placement
	- model loading and unloading
	- batch processing and streaming
	- caching and retry
	- memory pressure and prioritization
	- long-running jobs and device affinity

	### 11.2. Example device distribution

	```
	CPU: demux, decode, OCR preprocessing, ffmpeg ops, graph planning
	GPU 0: ASR / Whisper
	GPU 1: VLM / image analysis
	GPU 2: LLM / planner / router
	GPU 3: image/video generation
	```

	### 11.3. Model loading policy

	Not all models should be in memory at all times.

	```
	hot models: router, small LLM, ASR small
	warm models: VLM, medium LLM, image edit
	cold models: video generation, huge judge, rare experts
	```

	Scheduler capabilities:

	```
	preload, lazy load, unload, pin to GPU, move to CPU,
	quantized load, batch requests, reuse cache
	```

	---

	## 12. CLI

	### 12.1. FFmpeg-like CLI

	```bash
	omniff -i input.jpg \
	-prompt "make it matte graphite, preserve body and angle" \
	-of image \
	-o result.png
	```

	```bash
	omniff -i input.mp4 \
	-prompt "premium corporate ad style" \
	-thinking deep \
	-preserve faces,voice,structure \
	-strength 0.42 \
	-o output.mp4
	```

	```bash
	omniff -i lesson_audio.mp3 \
	-task summarize \
	-lang kk \
	-model auto \
	-o output.md
	```

	```bash
	omniff -i contract.pdf \
	-task "find risks and write brief summary" \
	-thinking deep \
	-o review.docx
	```

	### 12.2. Explicit graph CLI

	```bash
	omniff -i input.mp4 \
	-graph graphs/video_to_video_premium.yaml \
	-prompt "premium Apple-like corporate style" \
	-o output.mp4
	```

	---

	## 13. SDK / API

	### 13.1. Python API

	```python
	from omniff import OmniFFRuntime

	runtime = OmniFFRuntime.from_pretrained("stukenov/omniff-runtime")

	result = runtime.run(
	input="input.mp4",
	prompt="Video-to-video in premium style, preserve faces",
	output_modality="video",
	thinking="deep",
	controls={
	"preserve_identity": True,
	"preserve_audio": True,
	"style_strength": 0.45,
	"temporal_consistency": "high",
	},
	)

	result.save("output.mp4")
	```

	### 13.2. Planning API

	```python
	graph = runtime.plan(
	input="car.jpg",
	prompt="make it matte graphite",
	output_modality="image",
	)

	print(graph)
	result = runtime.execute(graph)
	```

	### 13.3. HTTP API

	```
	POST /v1/omniff/run
	```

	```json
	{
	"input": "file://input.mp4",
	"prompt": "premium style video",
	"output_modality": "video",
	"thinking": "deep",
	"controls": {
	"preserve_faces": true,
	"preserve_voice": true,
	"style_strength": 0.45
	}
	}
	```

	---

	## 14. Packaging

	### 14.1. Not one safetensors

	Production packaging:

	```
	omniff-runtime/
	omniff.yaml
	graph_templates/
	models/
	router/
	asr/
	vlm/
	llm_small/
	llm_large/
	image_generator/
	video_generator/
	tts/
	processors/
	validators/
	runtime/
	README.md
	```

	### 14.2. omniff.yaml

	```yaml
	name: omniff-runtime
	version: 0.1

	router:
	type: encoder_classifier
	path: models/router

	experts:
	text_small:
	type: causal_lm
	path: models/llm_small

	text_large:
	type: causal_lm
	path: models/llm_large

	asr:
	type: speech_to_text
	path: models/asr

	vision:
	type: vision_language
	path: models/vlm

	image_edit:
	type: diffusion_image_edit
	path: models/image_generator

	video_edit:
	type: diffusion_video_edit
	path: models/video_generator

	tts:
	type: text_to_speech
	path: models/tts
	```

	### 14.3. Hugging Face custom architecture

	For research/demo — HF repo with custom code:

	```
	configuration_omniff.py
	modeling_omniff.py
	processing_omniff.py
	config.json
	routing_config.yaml
	```

	Loading:

	```python
	model = OmniFFRuntime.from_pretrained(
	"stukenov/omniff-runtime",
	trust_remote_code=True,
	)
	```

	Production must not depend on loading everything as one `AutoModelForCausalLM`.

	---

	## 15. Cascade Routing

	### 15.1. Principle

	```
	simple → small model
	normal → medium model
	complex → large model
	critical → judge / human review
	```

	Saves cost by orders of magnitude on real traffic — large model often never starts.

	Quality depends on router accuracy.

	### 15.2. Escalation flow

	```
	Router selects minimum sufficient model
	→ model executes
	→ validator checks result
	→ on failure: escalate to stronger model or retry with adjusted controls
	→ on repeated failure: mark as failed or route to human review
	```

	---

	## 16. Safety and Quality

	### 16.1. Validator-first philosophy

	Every complex graph must have a validator.

	Text validators:

	```
	language, format, JSON schema, citation, risk, factuality
	```

	Image validators:

	```
	prompt adherence, identity preservation, layout preservation,
	NSFW/safety, artifact detection, OCR/text correctness
	```

	Video validators:

	```
	temporal consistency, face preservation, flicker detection,
	motion coherence, audio-video sync, prompt adherence
	```

	### 16.2. Escalation

	On validator failure:

	```
	retry same graph with adjusted controls
	→ or escalate to stronger model
	→ or ask for clarification
	→ or mark as failed
	→ or route to human review
	```

	---

	## 17. Logging and Router Training

	### 17.1. What to log

	```
	request_id, user_id/tenant_id, input modalities, output modality,
	prompt hash, language, task_type, selected_route, selected_models,
	thinking_mode, latency, cost estimate, success/failure,
	validator scores, fallbacks, retries, user feedback, output metadata
	```

	### 17.2. Router training

	Primary label: cheapest sufficient route — the cheapest model/graph that produced acceptable quality.

	Process:

	```
	1. Collect real and synthetic prompts
	2. Run through different route/model variants
	3. Score outputs with judge model + partial human eval
	4. Assign cheapest-sufficient label
	5. Train encoder-only classifier
	6. Export to ONNX/Candle
	7. Embed in runtime
	8. Continuously retrain on logs
	```

	---

	## 18. Technical Stack

	### 18.1. Runtime core

	Start:

	```
	Python + PyTorch + Transformers + Diffusers + FFmpeg bindings
	```

	Production:

	```
	Rust/Go runtime shell
	Python model workers where needed
	ONNX/Candle/TensorRT acceleration where justified
	```

	### 18.2. No dependency on vLLM/SGLang

	vLLM and SGLang may be optional backends but never the foundation.

	```
	OmniGraph owns routing, planning, graph execution, and scheduling.
	Model backends are replaceable.
	```

	### 18.3. Model backend types

	```
	PyTorch, Transformers, Diffusers, ONNX Runtime,
	Candle, GGUF/llama.cpp-style, custom CUDA kernels,
	external API adapter
	```

	---

	## 19. MVP Roadmap

	### v0.1 — Prove the architecture

	```
	text → text
	image → text
	audio → text → text
	image → image
	```

	Components:

	```
	OmniFF CLI
	OmniFrame / OmniPacket
	OmniGraph / OmniNode
	OmniFFRuntime
	encoder-only router
	ASR module, VLM module, LLM module
	image edit module, validator module
	```

	### v0.2

	```
	text → image
	video → text
	image → video
	document → text
	```

	### v0.3

	```
	video → video
	audio → audio
	document → document
	multi-pass validation
	scheduler
	model hot/warm/cold loading
	```

	### v1.0

	```
	universal graph planner
	Thinking+ controller
	prompt-control side data
	full modality matrix
	validators
	production scheduler
	CLI + SDK + HTTP API
	plugin model interface
	```

	---

	## 20. Product Identity

	### Short positioning

	```
	OmniFF Runtime is a FFmpeg-like multimodal AI processing engine.
	```

	### Extended positioning

	```
	A multimodal graph runtime with encoder-only routing,
	thinking-controlled planning, and pluggable model experts
	for text, speech, vision, image generation, video generation,
	documents, and structured outputs.
	```

	### Kazakh-first variant (OmniFF-KZ)

	```
	OmniFF-KZ is a Kazakh-first multimodal AI runtime that combines
	Qwen expert hierarchy, ASR, vision, image/video generation,
	and document intelligence through a unified graph execution engine
	with native Kazakh language support.
	```

	---

	## 21. What This Must Not Be

	OmniFF Runtime must not be:

	```
	a LangChain pipeline
	a collection of Python scripts
	a wrapper over vLLM
	a chatbot
	a HuggingFace demo
	a ComfyUI clone
	one big safetensors
	a gateway
	a multimodal LLM
	an agent framework
	```

	It must be:

	```
	runtime, format, graph engine, model orchestration layer,
	scheduler, prompt-control system, validator system, CLI/API/SDK
	```

	---

	## 22. Canonical Formula

	```
	input
	→ demux
	→ decode
	→ normalize into OmniFrames
	→ plan with Thinking+
	→ execute graph of AI filters/models
	→ validate
	→ encode
	→ mux
	→ output
	```

	One repo. One config. One processor. One runtime. One CLI. One API. Many experts. One graph executor.

	---

	## 23. Conclusion

	OmniFF Runtime is an infrastructure system of a new class: a multimodal AI runtime that relates to models the way FFmpeg relates to codecs.

	It does not compete with Qwen, Whisper, VLMs, diffusion models, or TTS. It uses them as interchangeable computational nodes.

	Its value is not "one model that does everything." Its value is a unified engineering way to build any transformation:

	```
	text ↔ audio ↔ image ↔ video ↔ document ↔ structured data
	```

	With control:

	```
	prompt, negative prompt, thinking, router, models,
	validators, constraints, schedulers, quality thresholds
	```

	Saken OmniFF Runtime:
	A FFmpeg-like AI runtime for routed, thinking-controlled, multimodal generation and transformation.

	Not "one model." A stronger category: an operating environment for multimodal AI inference and generation.